JP2005208852A

JP2005208852A - Summary registering apparatus, summary registration method and program

Info

Publication number: JP2005208852A
Application number: JP2004013614A
Authority: JP
Inventors: Takashi Osawa; 隆大澤
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2004-01-21
Filing date: 2004-01-21
Publication date: 2005-08-04

Abstract

<P>PROBLEM TO BE SOLVED: To efficiently generate a summary on which the retrieval intention of a user is reflected. <P>SOLUTION: A registration part 10 retrieves document data related to a designated index word from a document data group being a retrieval object, and returns the document data as the retrieval result. In a summary group storage part 20 which stores the summary data of the document data, the summary data of the document data being a new retrieval object is registered. Furthermore, the registration part 10 divides the document data being the new retrieval object into frame units, and extracts each index word from each phrase included in a frame group obtained by dividing the document data. Then, the registration part 10 preferentially selects the frame including the index word which does not overlap another phrase from the phrase group, and registers the phrase in a summary group storage part 20 as the summary data of the new document data. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、指定された索引語に関連する文書データを検索対象の文書データ群の中から検索して得られた検索結果として返信する文書データの要約データを記憶する要約記憶装置に、新たな文書データの要約データを登録する要約登録装置に関する。 The present invention provides a new summary storage device for storing summary data of document data to be returned as a search result obtained by searching document data related to a designated index word from a search target document data group. The present invention relates to a summary registration device for registering summary data of document data.

インターネットのサーチエンジンのような大規模な検索システムを利用する際には、検索結果中に含まれる要約文若しくはサンプル文等を示す要約データの内容を手がかりにすることがある。このように要約データの内容を手がかりに、検索された文書データが検索意図に適合しているか否かを判定することで、実際に検索された文書データを取得して内容を閲覧する必要がなくなるため、意図した文書データの検索にかかる時間を節約することができる。 When a large-scale search system such as an Internet search engine is used, the contents of summary data indicating summary sentences or sample sentences included in search results may be used as clues. Thus, by determining whether or not the retrieved document data conforms to the retrieval intention based on the contents of the summary data, it is not necessary to acquire the actually retrieved document data and browse the contents. Therefore, it is possible to save time required for retrieving the intended document data.

一般に、要約の生成は文書データ内の全テキストデータから重要部分を抽出し、抽出した重要部分を要約データ（静的要約）とする方法と、文書データ内の全テキストデータを蓄積し、検索条件に指定されたキーワードを含むデータを要約データ（動的要約）とする方法の２種類に大別される。 In general, the summary is generated by extracting important parts from all text data in the document data, using the extracted important parts as summary data (static summary), and accumulating all the text data in the document data and using the search conditions. Are classified into two types: summary data (dynamic summary).

しかし、静的要約は検索時にユーザが入力したキーワードとは無関係に一定の情報を表示するため、必ずしもユーザの検索意図を反映した要約とは言えない。そのため、検索意図に適合しているか否かを判定するユーザの負担を軽減することは困難な場合がある。これを解決するために、ユーザの入力したキーワードに関連した文書内の情報を動的に表示する動的要約の技術が研究、開発されている。例えば、Ｇｏｏｇｌｅ等の代表的な検索エンジンでは、ＫＷＩＣ（ＫｅｙｗｏｒｄｉｎＣｏｎｔｅｘｔ）と呼ばれる方式によって動的に要約を生成している。また、特許文献1のように文書の中から検索文との関連度合いの高い箇所を抽出する技術も検討されている。動的要約は、ユーザの検索語や検索文に基づいて生成するため、ユーザの検索意図を反映する要約を生成することは可能と考えられる。しかし、要約生成のために全テキストデータを蓄積した場合、検索対象となる情報の増大に伴ってデータベースが肥大化し、検索時間や要約生成時間が長くなる場合がある。 However, the static summary is not necessarily a summary reflecting the user's search intention because it displays certain information regardless of the keyword input by the user during the search. Therefore, it may be difficult to reduce the burden on the user who determines whether or not the search intention is satisfied. In order to solve this problem, a dynamic summarization technique for dynamically displaying information in a document related to a keyword input by a user has been researched and developed. For example, in a typical search engine such as Google, a summary is dynamically generated by a method called KWIC (Keyword in Context). Also, a technique for extracting a portion having a high degree of association with a search sentence from a document as in Patent Document 1 is being studied. Since the dynamic summary is generated based on the user's search word or search sentence, it is considered possible to generate a summary that reflects the user's search intention. However, when all the text data is accumulated for the summary generation, the database is enlarged as the information to be searched increases, and the search time and the summary generation time may become longer.

特開２００１−１６７０９６号公報Japanese Patent Laid-Open No. 2001-167096

本発明は、文書データに含まれる索引語を網羅した要約データ群を効率的に生成することを目的とする。 An object of the present invention is to efficiently generate a summary data group covering index terms included in document data.

本発明にかかる要約登録装置は、指定された索引語に関連する文書データを検索対象の文書データ群の中から検索して、その検索結果として返信する文書データの要約データを記憶する要約記憶装置に、新たな検索対象の文書データの要約データを登録する要約登録装置であって、新たな検索対象の文書データを所定の単位で分割する分割手段と、分割することで得られた分割データ群に含まれる各分割データからそれぞれ索引語を抽出する索引語抽出手段と、前記分割データ群の中から他の分割データに含まれる索引語と重複しない索引語を含む分割データを優先的に選択して、前記新たな文書データの要約データとして要約記憶装置に登録する登録手段と、を備えることを特徴とする。 The summary registration apparatus according to the present invention searches for document data related to a specified index word from a search target document data group, and stores summary data of document data to be returned as a search result. And a summary registering device for registering summary data of new search target document data, a splitting unit for splitting the new search target document data in a predetermined unit, and a split data group obtained by splitting Index word extracting means for extracting index words from each divided data included in the data, and preferentially selecting divided data including index words that do not overlap with index words included in other divided data from the divided data group Registering means for registering in the summary storage device as summary data of the new document data.

本発明によれば、分割手段が検索対象の文書データを分割することで得られた分割データ群の中から他の分割データに含まれる索引語と重複しない索引語を含む分割データを、登録手段は優先的に選択して、前記新たな文書データの要約データとして要約記憶装置に登録する。よって、一つの文書データあたりに記憶する要約データのサイズや数を制限し、要約記憶装置の容量を抑えつつ、文書データに含まれる索引語をできるだけ多く網羅できる要約データを優先的に要約記憶装置に登録することができ、例えば、検索条件で指定された索引語に関連する文書データであるにもかかわらず、要約記憶装置に登録された要約データにはその索引語が含まれておらず、ユーザに対してその文書データの要約データを返信することができない事態を抑制することができる。 According to the present invention, the divided data including the index words that do not overlap with the index words included in the other divided data from the divided data group obtained by dividing the document data to be searched by the dividing means is registered. Is preferentially selected and registered in the summary storage device as summary data of the new document data. Therefore, summary data storage device that preferentially covers summary data that can cover as many index terms as possible while limiting the size and number of summary data stored per document data and suppressing the capacity of the summary storage device. For example, although the document data is related to the index word specified in the search condition, the index data is not included in the summary data registered in the summary storage device, A situation in which summary data of the document data cannot be returned to the user can be suppressed.

本発明にかかる要約登録装置の一つの態様によれば、前記登録手段は、分割データ群の中から他の分割データに含まれる索引語と重複しない索引語を含む分割データであって、他の分割データよりも索引語を多く含む分割データを、優先的に選択して、前記新たな文書データの要約データとして要約記憶装置に登録することを特徴とする。 According to one aspect of the summary registration apparatus according to the present invention, the registration unit is divided data including an index word that does not overlap with an index word included in another divided data from among the divided data group, The divided data including more index words than the divided data is preferentially selected and registered in the summary storage device as the summary data of the new document data.

本発明によれば、より多くの索引語を含む分割データを優先的に要約データとして要約記憶装置に登録することができ、例えば、ユーザが検索条件として多くの索引語を指定して検索を行った場合に、それらの索引語を多く含んだ要約データを返信することができる可能性を高くすることができる。 According to the present invention, divided data including more index words can be preferentially registered in the summary storage device as summary data. For example, the user can specify a number of index words as a search condition and perform a search. In such a case, it is possible to increase the possibility that summary data including many index words can be returned.

本発明の実施の形態（以下、実施形態とする）について、図面を用いて説明する。 Embodiments of the present invention (hereinafter referred to as embodiments) will be described with reference to the drawings.

図１は、本実施形態における文書検索装置の機能ブロック図である。文書検索装置は、収集した文書データの要約データを要約群蓄積部２０に登録する登録部１０と、登録部１０により登録された要約データを記憶する要約群蓄積部２０と、ユーザが指定した索引語（キーワード）に基づいて検索された文書データに関連する要約データを、要約群蓄積部２０に記憶した要約データ群の中から抽出する検索部３０とを含み構成される。なお、本実施形態において文書データとは、ＪＩＳ漢字コードやＡＳＣＩＩコードなどで規定された文字コードで構成される主にテキストデータのことをいう。 FIG. 1 is a functional block diagram of a document search apparatus according to this embodiment. The document search apparatus includes a registration unit 10 that registers summary data of collected document data in the summary group storage unit 20, a summary group storage unit 20 that stores summary data registered by the registration unit 10, and an index specified by a user. And a search unit 30 that extracts summary data related to document data searched based on words (keywords) from the summary data group stored in the summary group storage unit 20. In the present embodiment, the document data refers mainly to text data composed of character codes defined by JIS Kanji codes, ASCII codes, or the like.

文書検索装置は、例えば、インターネットに接続され、インターネットに存在するＷｅｂサーバが保有するＷｅｂに関する文書データの検索をする際に利用される。また、社内イントラネットに接続され、そのイントラネット内に存在するファイルサーバが保有する機器等の説明書やオフィス文書などの文書データの検索をする際にも利用される。 The document search device is used, for example, when searching for document data related to the Web held by a Web server connected to the Internet and existing on the Internet. Further, it is also used when searching for document data such as instructions and office documents of devices and the like that are connected to an in-house intranet and possessed by a file server existing in the intranet.

このように構成された文書検索装置によれば、収集した文書データ群の中からユーザが指定した索引語に関連する文書データの内容の一部を示した要約データを抽出して、その要約データをユーザに返信する。ユーザは、返信された要約データの内容を参照することで、索引語に関連する文書データのすべての内容を参照することなく、文書検索装置から返信された検索結果が自己の検索意図を反映するものか否かを判別することができる。 According to the document search device configured as described above, the summary data indicating a part of the content of the document data related to the index word specified by the user is extracted from the collected document data group, and the summary data is extracted. To the user. The user refers to the content of the returned summary data, and the search result returned from the document search device reflects his / her search intention without referring to the entire content of the document data related to the index word. It can be determined whether or not it is a thing.

このように検索結果として返信される要約データの生成方法としては、例えば、ＫＷＩＣ（ＫｅｙｗｏｒｄｉｎＣｏｎｔｅｘｔ）に基づく要約の生成方法がある。この手法は、収集した文書データをすべて蓄積しておき、検索条件の索引語を含む要約データをその都度生成し返信する手法であり、ＷＷＷ検索エンジンの要約表示に広く使用されている。しかし、収集した文書データを蓄積するデータベースの容量は物理的に有限であり、データベースに蓄積可能な文書データの量にも限度がある。また、文書データの増大に伴いデータベースが肥大化すると、検索時間や要約生成時間が長くなる。 As a method of generating summary data returned as a search result in this way, for example, there is a method of generating a summary based on KWIC (Keyword in Context). This technique is a technique for accumulating all collected document data and generating and returning summary data including an index word of a search condition each time, and is widely used for summary display of a WWW search engine. However, the capacity of the database for storing the collected document data is physically limited, and the amount of document data that can be stored in the database is also limited. Further, when the database is enlarged as the document data increases, the search time and the summary generation time become longer.

特に、ＷＷＷのコンテンツのみを検索対象とした場合、１つのコンテンツあたりの全テキストのサイズは数キロバイト程度であるが、一般のオフィス文書の検索対象とした場合、１つのオフィス文書あたりの全テキストデータのサイズは、ＷＷＷのコンテンツあたりの全テキストデータのサイズの１０倍以上になることも多く、ＫＷＩＣのように全文書データを蓄積する手法では、データベースに蓄積可能なオフィス文書の数が少なくなる場合がある。 In particular, when only WWW content is searched, the size of all text per content is about several kilobytes. However, when searching for general office documents, all text data per office document is used. Is often more than 10 times the size of all text data per WWW content, and the method of storing all document data like KWIC reduces the number of office documents that can be stored in the database. There is.

上記課題を解決するために、収集した文書データの中から予め一部分のフレーズを要約データとして切り出して、形態素解析により抽出された語との関係をデータベースに記憶することで、一つの文書データあたりに記憶する要約データのサイズや数を制限し、データベースの容量を抑え、検索時間の増大を防ぐ手法が考えられる。しかし、この手法では、文書データをすべて記憶しているわけでないので、文書データに含まれる索引語がすべて、切り出された一部分のフレーズに含まれているとは限らない。よって、検索条件で指定された索引語に関連する文書データであるにもかかわらず、データベースに登録された要約データにはその索引語が含まれておらず、ユーザに対してその文書データの要約データを返信することができない場合がある。特に、文書データのデータ量が多いほどこのようなことが起こりやすくなる。このように索引語を含む要約データをユーザに返信することができないと、ユーザは検索結果として得られる文書データが索引語を含む適当な文書データか否かを実際に文書データの内容を参照しなければ判断することができず、ユーザの負担が大きくなる。 In order to solve the above-mentioned problem, a part of phrases is extracted from the collected document data in advance as summary data, and the relationship with words extracted by morphological analysis is stored in a database, so that each document data A method of limiting the size and number of summary data to be stored, reducing the database capacity, and preventing an increase in search time can be considered. However, since this method does not store all the document data, all index words included in the document data are not always included in the extracted partial phrase. Therefore, although the document data is related to the index word specified in the search condition, the summary data registered in the database does not include the index word, and the user summarizes the document data. Data may not be returned. In particular, this is more likely to occur as the amount of document data increases. If summary data including an index word cannot be returned to the user in this way, the user actually refers to the contents of the document data to determine whether or not the document data obtained as a search result is appropriate document data including the index word. Otherwise, it cannot be determined and the burden on the user increases.

そこで、本実施形態では、一つの文書データあたりに記憶する要約データのサイズや数を制限しデータベースの容量を抑えつつ、文書データに含まれる索引語をできるだけ多く網羅できる要約データを生成する文書検索装置を提供する。 Therefore, in the present embodiment, a document search that generates summary data that can cover as many index terms contained in document data as possible while limiting the size and number of summary data stored per document data and suppressing the database capacity. Providing equipment.

ここで、文書データの一部を切り出して要約データとして要約群蓄積部２０に登録する登録部１０の詳細について、図２に示す機能ブロック図を用いて説明する。 Here, the details of the registration unit 10 that cuts out part of the document data and registers it as summary data in the summary group storage unit 20 will be described with reference to the functional block diagram shown in FIG.

文書データ収集部１１は、ロボット検索等により文書検索装置が接続されたネットワーク内に存在するＷｅｂサーバやファイルサーバ等の文書データを保有する文書管理装置から順次文書データを収集するモジュールである。フレーズ・索引語表作成部１２は、収集した文書データに基づいて、図３に示すようなフレーズ・索引語表を作成するモジュールである。フレーズ・索引語表の作成に関しては、後ほど詳しく説明する。登録フレーズ抽出部１３は、フレーズ・索引語表に基づいて、要約データとして登録するフレーズを抽出して、要約群蓄積部２０に登録するモジュールである。 The document data collection unit 11 is a module that sequentially collects document data from a document management apparatus that holds document data such as a Web server or a file server that exists in a network to which the document search apparatus is connected by robot search or the like. The phrase / index word table creation unit 12 is a module that creates a phrase / index word table as shown in FIG. 3 based on the collected document data. The creation of the phrase / index word table will be described in detail later. The registered phrase extraction unit 13 is a module that extracts a phrase to be registered as summary data based on the phrase / index word table and registers it in the summary group storage unit 20.

さらに、フレーズ・索引語表作成部１２でのフレーズ・索引語表作成方法について詳しく説明する。フレーズ・索引語表作成部１２は、まず、収集した文書データを句読点を単位として複数のフレーズ（句）に分割し、その分割して得られた各フレーズについて、文書の先頭から順にそれぞれ固有のフレーズＩＤを付与する。なお、文書データは、句読点を単位に分割するほか、一定の語数を単位として分割する等他の単位で分割してもよい。 Furthermore, the phrase / index word table creation method in the phrase / index word table creation unit 12 will be described in detail. The phrase / index word table creation unit 12 first divides the collected document data into a plurality of phrases (phrases) with punctuation marks as a unit, and each phrase obtained by the division is unique from the top of the document. A phrase ID is assigned. Note that the document data may be divided into other units such as a punctuation mark as a unit or a fixed number of words as a unit.

分割後、フレーズ・索引語表作成部１２は、得られた各フレーズに含まれる索引語をいわゆる形態素解析処理により抽出する。このようにして得られた結果に基づいて、フレーズ・索引語表作成部１２は、図３に示すようなフレーズ・索引語表を作成する。図３に示すフレーズ・索引語表において、フレーズＩＤ：０のフレーズには、索引語として「Ｗ１」と「Ｗ３」とが含まれていることを示す。 After the division, the phrase / index word table creation unit 12 extracts index words included in each obtained phrase by so-called morphological analysis processing. Based on the result thus obtained, the phrase / index word table creation unit 12 creates a phrase / index word table as shown in FIG. In the phrase / index word table shown in FIG. 3, the phrase with phrase ID: 0 indicates that “W1” and “W3” are included as index words.

続いて、登録フレーズ抽出部１３での要約データ登録方法について説明する。まず、初期ベクトルＶＯとして、（Ｗ０，Ｗ１，Ｗ２，Ｗ３，Ｗ４，・・・，Ｗｘ）＝（０，０，０，０，０，・・・，０）を定義する。ここで、配列された「０」の数は、文書データを分割して得られたフレーズ群から抽出された索引語の数と一致する。各フレーズについても同様にベクトルで表現する。例えば、フレーズＩＤ：０のベクトルＶ０の場合、（０，１，０，１，０，・・・，０）となる。つまり、フレーズのベクトルの要素群のうち、自己に含まれる索引語に相当する位置の要素のみが「１」として表現される。 Next, a summary data registration method in the registered phrase extraction unit 13 will be described. First, (W0, W1, W2, W3, W4,..., Wx) = (0, 0, 0, 0, 0,..., 0) is defined as the initial vector VO. Here, the number of arranged “0” s matches the number of index words extracted from the phrase group obtained by dividing the document data. Each phrase is similarly expressed as a vector. For example, in the case of the vector V0 of the phrase ID: 0, (0, 1, 0, 1, 0, ..., 0) is obtained. That is, of the phrase vector element group, only the element at the position corresponding to the index word included in the phrase is expressed as “1”.

そして、登録フレーズ抽出部１３は、まず初期ベクトルＶＯの各要素と、フレーズＩＤ：０のベクトルＶ０の各要素との論理和をとる。これにより得られた結果ベクトルＶは、（０，１，０，１，０，・・・，０）となる。続いて、論理和をとって得られた結果ベクトルＶの各要素とフレーズＩＤ：１のベクトルＶ１の各要素との論理和をとる。これにより、（０，１，０，１，１，・・・，０）という結果ベクトルＶを得る。 The registered phrase extraction unit 13 first performs a logical sum of each element of the initial vector VO and each element of the vector V0 with the phrase ID: 0. The result vector V thus obtained is (0, 1, 0, 1, 0,..., 0). Subsequently, the logical sum of each element of the result vector V obtained by taking the logical sum and each element of the vector V1 of the phrase ID: 1 is obtained. As a result, a result vector V of (0, 1, 0, 1, 1,..., 0) is obtained.

このように、フレーズＩＤの値を順次インクリメントしながら、前回論理和をとることで得られた結果ベクトルＶｏｌｄの各要素と、フレーズのベクトルの各要素との論理和をとっていく。その結果、得られた結果ベクトルＶｎｅｗの要素群の中に「０」から「１」に変化した要素を含むフレーズがあった場合、そのフレーズを登録対象フレーズとする。つまり、結果ベクトルＶｏｌｄの要素が「０」から「１」に変化したということは、以前に論理和をとったフレーズには含まれていない索引語が、今回の論理和をとる対象となったフレーズには含まれていることを意味するため、優先的にそのフレーズを要約データとして要約群蓄積部２０に登録する。 In this way, the logical sum of each element of the result vector Vold obtained by taking the previous logical sum and each element of the phrase vector is taken while sequentially incrementing the value of the phrase ID. As a result, if there is a phrase including an element changed from “0” to “1” in the element group of the obtained result vector Vnew, the phrase is set as a registration target phrase. In other words, the fact that the element of the result vector Vold has changed from “0” to “1” means that the index word that is not included in the previously logically ORed phrase is subject to the current ORing. Since it means that the phrase is included in the phrase, the phrase is preferentially registered in the summary group storage unit 20 as summary data.

上記例でいうと、フレーズＩＤ：０のフレーズも、フレーズＩＤ：１のフレーズも共に結果ベクトルＶの要素群の中に「０」から「１」に変化した要素が含まれているため、フレーズＩＤ：０およびフレーズＩＤ：１はともに登録対象フレーズとなる。一方、フレーズＩＤ「２」のベクトルＶ２の各要素との論理和をとると、この場合、結果ベクトルＶは、（０，１，０，１，１，・・・，０）のままであり、要素の変化がない。つまり、フレーズＩＤ「２」のフレーズに含まれる索引語は、すでに登録対象フレーズとして挙げられているフレーズＩＤ：０やフレーズＩＤ：１のフレーズに含まれる索引語しか存在しないことを意味する。したがって、たとえフレーズＩＤ：２のフレーズを要約データとして登録しなくても、フレーズＩＤ：２のフレーズに含まれる索引語を含む別のフレーズが要約データとして登録されているため、検索条件で指定された索引語に関連する文書データであるにもかかわらず、要約群蓄積部２０に登録された要約データにはその索引語が含まれておらず、ユーザに対してその文書データの要約データを返信することができないという現象は起きない。 In the above example, both the phrase ID: 0 phrase and the phrase ID: 1 phrase include the elements changed from “0” to “1” in the element group of the result vector V. Both ID: 0 and phrase ID: 1 are registration target phrases. On the other hand, if the logical sum with each element of the vector V2 of the phrase ID “2” is taken, in this case, the result vector V remains (0, 1, 0, 1, 1,..., 0). There is no change in the elements. That is, the index word included in the phrase with the phrase ID “2” means that only the index word included in the phrase with the phrase ID: 0 or the phrase ID: 1 already listed as the registration target phrase exists. Therefore, even if the phrase with the phrase ID: 2 is not registered as summary data, another phrase including the index word included in the phrase with the phrase ID: 2 is registered as summary data. Although the document data is related to the index word, the summary data registered in the summary group storage unit 20 does not include the index word, and the summary data of the document data is returned to the user. The phenomenon of not being able to do not happen.

このように、文書データを複数のフレーズに分割して、そのフレーズ群の中から他のフレーズには含まれていない索引語を含むフレーズを優先的に登録対象フレーズとして抽出して、抽出された登録対象フレーズをすべて要約群蓄積部２０にその文書データの要約データとして登録してもよい。 In this way, the document data is divided into a plurality of phrases, and phrases including index words that are not included in other phrases are preferentially extracted as registration target phrases from the phrase group. All the registration target phrases may be registered in the summary group storage unit 20 as summary data of the document data.

ただ、抽出された登録対象フレーズの数が多いと要約群蓄積部２０の容量を圧迫し、多くの文書データについて要約データを登録できない場合もある。そこで、本実施形態における登録フレーズ抽出部１３は、一つの文書データに対して登録できる要約データの数をある程度の数に制限する。すなわち、フレーズＩＤ：０から順番に結果ベクトルＶ（最初は初期ベクトルＶＯ）との論理和をとり、登録対象フレーズとして挙げられたフレーズの数が所定の基準値を超えた時点で、その文書データに対する要約データの登録は中止し、別の文書データについての要約データの登録処理を行う。 However, if the number of registration target phrases extracted is large, the capacity of the summary group storage unit 20 may be reduced, and summary data may not be registered for a large amount of document data. Therefore, the registered phrase extraction unit 13 in the present embodiment limits the number of summary data that can be registered for one document data to a certain number. That is, the logical sum of the result vector V (initially the initial vector VO) in order from the phrase ID 0 is taken, and when the number of phrases listed as registration target phrases exceeds a predetermined reference value, the document data The registration of the summary data for is stopped, and the summary data registration process for another document data is performed.

なお、要約候補として挙げられたフレーズの数が所定の基準値を超えない場合には、そのまま登録対象フレーズのみを要約群蓄積部２０に登録して終了してもよいが、一つの文書データに対してできるだけ多くの要約データを登録するために、本実施形態では、登録対象フレーズとして挙げられなかった他のフレーズを、フレーズＩＤの小さい順に抽出して基準値を満たすまでその文書データの要約データとして登録することとする。 If the number of phrases listed as summary candidates does not exceed the predetermined reference value, only the registration target phrase may be registered in the summary group storage unit 20 as it is, but the process may be terminated. In order to register as much summary data as possible, in this embodiment, other phrases that are not listed as registration target phrases are extracted in ascending order of phrase IDs, and the summary data of the document data is satisfied until the reference value is satisfied. Will be registered as

このように登録された要約データを利用して、検索部３０は次のようにユーザに検索結果を返信する。すなわち、検索部３０は、ユーザからの検索条件として指定された索引語を受け取って、その索引語に関連する文書データを検索する。そして、その検索条件にあった文書データが存在する場合は、その文書データの識別情報に基づいて、要約群蓄積部２０からその文書データに対応する要約データであって、指定された索引語を含む少なくとも一つの要約データを抽出して、ユーザに検索結果としてその要約データを返信する。ユーザは、その返信された要約データの内容から検索結果として得られた文書データの概要を判断して、自己の検索条件を満たす文書データか否かを判別する。 Using the summary data registered in this way, the search unit 30 returns a search result to the user as follows. That is, the search unit 30 receives an index word designated as a search condition from the user, and searches for document data related to the index word. Then, if there is document data that meets the search condition, based on the identification information of the document data, the summary data corresponding to the document data from the summary group storage unit 20, and the designated index word At least one summary data is extracted, and the summary data is returned as a search result to the user. The user determines the outline of the document data obtained as a search result from the contents of the returned summary data, and determines whether the document data satisfies the search condition of the user.

なお、各機能ブロックは、ハードウェアで構成してもよいし、ソフトウェアで構成してよい。ソフトウェアで構成する場合は、文書検索装置に搭載されたＣＰＵにおいて、メモリに記憶されたプログラムを読み出して、要約データの登録処理を実行すればよい。 Each functional block may be configured by hardware or software. In the case of using software, the CPU installed in the document search apparatus may read the program stored in the memory and execute the summary data registration process.

続いて、図４に示すフローチャートをもとに、登録部１０での要約データ登録の全体の動作フローについて説明する。 Next, an overall operation flow of summary data registration in the registration unit 10 will be described based on the flowchart shown in FIG.

まず、文書データ収集部１１が収集した文書データを、フレーズ・索引語表作成部１２がフレーズ単位に分割し、それぞれのフレーズについて形態素解析により索引語を抽出し、フレーズ・索引語表を作成する（Ｓ１０１）。そして、登録フレーズ抽出部１３は、作成されたフレーズ・索引語表に基づいて、分割して得られた合計フレーズ数Ｍが、予め定められた基準値Ｎ（例えば、Ｎ＝１６０）以上か否かを判定する（Ｓ１０２）。Ｓ１０２での判定の結果、合計フレーズ数Ｍが基準値Ｎ以下であれば、登録フレーズ抽出部１３は文書データを分割して得られたすべてのフレーズをその文書データの識別情報と関連づけて要約データとして、要約群蓄積部２０に登録する（Ｓ１０３）。 First, the document data collected by the document data collecting unit 11 is divided into phrases by the phrase / index word table creating unit 12, and index words are extracted by morphological analysis for each phrase to create a phrase / index word table. (S101). Then, the registered phrase extracting unit 13 determines whether the total number of phrases M obtained by dividing based on the created phrase / index word table is equal to or greater than a predetermined reference value N (for example, N = 160). Is determined (S102). If the total number of phrases M is equal to or less than the reference value N as a result of the determination in S102, the registered phrase extraction unit 13 associates all phrases obtained by dividing the document data with the identification information of the document data and summarizes the data. Is registered in the summary group storage unit 20 (S103).

一方、Ｓ１０２での判定の結果、合計フレーズ数Ｍが基準値Ｎ以上であれば、続いて分割して得られたフレーズ群の中から要約データとして登録するフレーズの選定を行うフェーズに移る。すなわち、まず、登録フレーズ抽出部１３は、結果ベクトルＶとの論理和をとるフレーズに対応するベクトルＶｉのフレーズＩＤの値を示すカウンタＩと、フレーズ群の中から登録対象フレーズとして抽出されたフレーズの数を示すカウンタＣと、の値をリセットして、「０」にする（Ｓ１０４）。そして、Ｉが示す値のフレーズＩＤをもつフレーズのベクトルＶｉの各要素と結果ベクトルの各要素との論理和をとる（Ｓ１０５）。その論理和の結果を新たな結果ベクトルＶｎｅｗとし、前回とった論理和の結果を結果ベクトルＶｏｌｄとする。 On the other hand, as a result of the determination in S102, if the total number of phrases M is equal to or greater than the reference value N, the process proceeds to a phase in which phrases to be registered as summary data are selected from phrase groups obtained by subsequent division. That is, first, the registered phrase extracting unit 13 extracts a counter I indicating the value of the phrase ID of the vector Vi corresponding to the phrase to be ORed with the result vector V, and the phrase extracted as a registration target phrase from the phrase group. The counter C indicating the number of the counters is reset to “0” (S104). Then, the logical sum of each element of the phrase vector Vi having the phrase ID of the value indicated by I and each element of the result vector is calculated (S105). The result of the logical sum is set as a new result vector Vnew, and the result of the logical sum taken last time is set as a result vector Vold.

続いて、登録フレーズ抽出部１３はＶｎｅｗとＶｏｌｄとが同一か否かを判定する（Ｓ１０６）。つまり、ベクトルＶｉのフレーズには、すでに登録対象フレーズとして挙げられたフレーズに含まれていない索引語が含まれているか否かを登録フレーズ抽出部１３は判定する。 Subsequently, the registered phrase extraction unit 13 determines whether Vnew and Vold are the same (S106). That is, the registered phrase extraction unit 13 determines whether or not the phrase of the vector Vi includes an index word that is not included in the phrase already listed as the registration target phrase.

Ｓ１０６での判定の結果、ＶｎｅｗとＶｏｌｄとが同一でない場合、ベクトルＶｉのフレーズには、すでに登録対象フレーズとして挙げられたフレーズに含まれていない索引語が含まれていることを意味するため、そのフレーズを登録対象フレーズとして、カウンタＣの値を１つインクリメントし、ＶｎｅｗをＶｏｌｄに変更する（Ｓ１０７）。その後、登録フレーズ抽出部１３は、カウンタＣの値が基準値Ｎと同一か否かを判定する（Ｓ１０８）。 As a result of the determination in S106, if Vnew and Vold are not the same, it means that the phrase of the vector Vi includes an index word that is not included in the phrase already listed as the registration target phrase. Using the phrase as a registration target phrase, the value of the counter C is incremented by 1, and Vnew is changed to Vold (S107). Thereafter, the registered phrase extraction unit 13 determines whether or not the value of the counter C is the same as the reference value N (S108).

Ｓ１０８での判定の結果、カウンタＣの値が基準値Ｎと同一の場合、すでに登録対象フレーズの数が基準値Ｎに達していることを意味するため、登録フレーズ抽出部１３はこの時点で登録対象フレーズの選定を中止し、現段階で登録対象フレーズとして挙げられたフレーズをすべて、その文書データの識別情報と関連づけて要約データとして要約群蓄積部２０に登録する（Ｓ１０９）。 As a result of the determination in S108, if the value of the counter C is the same as the reference value N, it means that the number of registration target phrases has already reached the reference value N. Therefore, the registered phrase extraction unit 13 registers at this time. The selection of the target phrase is stopped, and all the phrases listed as registration target phrases at the present stage are registered in the summary group storage unit 20 as summary data in association with the identification information of the document data (S109).

一方、Ｓ１０８での判定の結果、カウンタＣの値が基準値Ｎと一致しなかった場合、まだ、登録対象フレーズの数が基準値Ｎに達していないことを意味するため、登録フレーズ抽出部１３は、分割して得られたすべてのフレーズについて結果ベクトルとの論理和をとったか否かを判定するため、カウンタＩの値がＭ-１かどうか（最後のフレーズに到達したかどうか）を判定する（Ｓ１１０）。 On the other hand, if the result of determination in S108 is that the value of the counter C does not match the reference value N, it means that the number of registration target phrases has not yet reached the reference value N. Determines whether or not the value of counter I is M-1 (whether or not the last phrase has been reached) in order to determine whether or not all the phrases obtained by division have been ORed with the result vector (S110).

Ｓ１１０での判定の結果、カウンタＩの値がＭ-１と一致しなかった場合、まだ結果ベクトルとの論理和をとっていない、つまりすでに登録対象フレーズに挙げられたフレーズに含まれていない索引語を含むか否かの判定を行っていないフレーズが存在することを意味するため、カウンタＩの値を１つインクリメントし（Ｓ１１１）、そのインクリメントしたカウンタＣの値と同一のフレーズＩＤをもつフレーズについて、Ｓ１０５から上記処理を繰り返す。 If the result of determination in S110 is that the value of the counter I does not match M-1, an index that has not yet been ORed with the result vector, that is, is not already included in the phrase listed in the registration target phrase This means that there is a phrase that has not been determined whether or not it contains a word. Therefore, the value of the counter I is incremented by one (S111), and the phrase having the same phrase ID as the value of the incremented counter C The above process is repeated from S105.

一方、Ｓ１１０での判定の結果、カウンタＩの値がＭ−１と一致した場合は、すでにその文書データを分割して得られたすべてのフレーズについて論理和をとったことを意味するため、この時点で登録対象フレーズとして挙げられたフレーズをすべて、その文書データの識別情報と関連づけて要約データとして要約群蓄積部２０に登録する（Ｓ１１２）。さらに、登録対象フレーズとして挙げられたフレーズの数がまだ基準値Ｎに達していないため、さらにカウンタＣの値が基準値Ｎに達するまで、登録対象フレーズとして挙げられなかった他のフレーズを、フレーズＩＤの小さい順に抽出して、それらも合わせて要約データとして要約群蓄積部２０に登録する（Ｓ１１３）。 On the other hand, if the value of the counter I matches M−1 as a result of the determination in S110, this means that all the phrases already obtained by dividing the document data have been logically ORed. All the phrases listed as registration target phrases at the time are registered in the summary group storage unit 20 as summary data in association with the identification information of the document data (S112). Furthermore, since the number of phrases listed as registration target phrases has not yet reached the reference value N, other phrases that are not listed as registration target phrases until the value of the counter C reaches the reference value N The IDs are extracted in ascending order and registered together in the summary group storage unit 20 as summary data (S113).

本実施形態によれば、登録部１０は、収集した文書データを分割して得られたフレーズ群のうち、他のフレーズには含まれていない索引語を含むフレーズを優先的に選択して要約群蓄積部２０に登録する。よって、一つの文書データあたりに記憶する要約データのサイズや数を制限し要約群蓄積部２０の容量を抑えつつ、文書データに含まれる索引語をできるだけ多く網羅できる要約データを優先的に要約群蓄積部２０に登録する文書検索装置を提供することができ、検索条件で指定された索引語に関連する文書データであるにもかかわらず、要約群蓄積部２０に登録された要約データにはその索引語が含まれておらず、ユーザに対してその文書データの要約データを返信することができないという事態を抑制することができる。 According to the present embodiment, the registration unit 10 preferentially selects and summarizes phrases including index words that are not included in other phrases from among a group of phrases obtained by dividing the collected document data. Register in the group storage unit 20. Therefore, summary data that can cover as many index terms as possible while limiting the size and number of summary data stored per document data and reducing the capacity of the summary group storage unit 20 is preferentially summarized. A document search device registered in the storage unit 20 can be provided, and the summary data registered in the summary group storage unit 20 is the document data related to the index word specified by the search condition. It is possible to suppress a situation in which the index word is not included and the summary data of the document data cannot be returned to the user.

なお、本実施形態では、文書データに存在する文書の出現順にフレーズＩＤを振り、フレーズ・索引語表を作成する例を示した。しかしその他、一旦文書データをすべてフレーズ単位に分割した後、索引語が多く含まれるフレーズから順にフレーズＩＤを振り、フレーズ・索引語表を作成してもよい。このようにフレーズ・索引語表を作成することで、より多くの索引語を含むフレーズを優先的に要約データとして要約群蓄積部２０に登録することができる。よって、ユーザが検索条件として多くの索引語を指定して検索を行った場合に、それらの索引語を多く含んだフレーズを要約データとして返信することができる可能性を高くすることができる。 In the present embodiment, an example is shown in which phrase IDs are assigned in the order in which documents existing in document data appear, and a phrase / index word table is created. However, once the document data is once divided into phrases, the phrase IDs may be assigned in order from the phrase including many index words to create a phrase / index word table. By creating the phrase / index word table in this way, phrases including more index words can be preferentially registered in the summary group storage unit 20 as summary data. Therefore, when a user performs a search by specifying many index terms as a search condition, it is possible to increase the possibility that a phrase including many index terms can be returned as summary data.

本実施形態における文書検索装置の機能ブロックを示す図である。It is a figure which shows the functional block of the document search apparatus in this embodiment. 本実施形態の文書検索装置における登録部の詳細の機能ブロックを示す図である。It is a figure which shows the functional block of the detail of the registration part in the document search device of this embodiment. 本実施形態の文書検査装置において作成するフレーズ・索引語表の一例を示す図である。It is a figure which shows an example of the phrase and index word table produced in the document inspection apparatus of this embodiment. 本実施形態の文書検索装置において、要約データを要約群蓄積部に登録する処理を示すフローチャート図である。It is a flowchart figure which shows the process which registers summary data in a summary group accumulation | storage part in the document search device of this embodiment.

Explanation of symbols

１０登録部、１１文書データ収集部、１２フレーズ・索引語表作成部、１３登録フレーズ抽出部、２０要約群蓄積部、３０検索部。 DESCRIPTION OF SYMBOLS 10 registration part, 11 document data collection part, 12 phrase and index word table preparation part, 13 registration phrase extraction part, 20 summary group accumulation | storage part, 30 search part.

Claims

Document data related to the specified index word is searched from the search target document data group, and the new document data to be searched is stored in the summary storage device for storing the summary data of the document data to be returned as the search result. A summary registration device for registering summary data of
Dividing means for dividing new document data to be searched in a predetermined unit;
Index word extraction means for extracting an index word from each divided data included in the divided data group obtained by dividing;
Registration means for preferentially selecting divided data including an index word that does not overlap with an index word included in other divided data from the divided data group and registering it in the summary storage device as summary data of the new document data When,
A summary registration device comprising:

The summary registration device according to claim 1,
The registration unit preferentially selects divided data including an index word that does not overlap with an index word included in another divided data from the divided data group, and includes more index words than other divided data. A summary registration device that selects and registers in the summary storage device as summary data of the new document data.

A new search target document is stored in the summary storage device for storing the summary data of the document data, which is retrieved from the search target document data group and the document data related to the specified index word is returned as a search result. A summary registration method for registering summary data of data,
A division step of dividing new search target document data by a predetermined unit;
An index word extraction step for extracting an index word from each divided data included in the divided data group obtained by dividing;
A registration step of preferentially selecting divided data including an index word that does not overlap with an index word included in another divided data from the divided data group, and registering it in the summary storage device as summary data of the new document data When,
Summary registration method including

In the summary registration method according to claim 3,
In the registration step, divided data including an index word that does not overlap with an index word included in another divided data from among the divided data group, and the divided data including more index words than the other divided data is prioritized. And registering the new document data as summary data in the summary storage device.

A new search target document is stored in the summary storage device for storing the summary data of the document data, which is retrieved from the search target document data group and the document data related to the specified index word is returned as a search result. A program for causing a computer to execute processing for registering data summary data,
A division step of dividing new document data to be searched in a predetermined unit;
An index word extraction step for extracting an index word from each divided data included in the divided data group obtained by dividing;
A registration step of preferentially selecting divided data including an index word that does not overlap with an index word included in other divided data from the divided data group, and registering it in the summary storage device as summary data of the new document data When,
A program that causes a computer to execute.

The program according to claim 5,
In the registration step, divided data including an index word that does not overlap with an index word included in another divided data in the divided data group, and the divided data including more index words than the other divided data is prioritized. And registering it in the summary storage device as summary data of the new document data.