JP2011103075A

JP2011103075A - Method for extracting excerpt sentence

Info

Publication number: JP2011103075A
Application number: JP2009258087A
Authority: JP
Inventors: Toshio Ikeda; 利夫池田
Original assignee: Kansai Electric Power Co Inc
Current assignee: Kansai Electric Power Co Inc
Priority date: 2009-11-11
Filing date: 2009-11-11
Publication date: 2011-05-26

Abstract

<P>PROBLEM TO BE SOLVED: To accurately extract a principal part in a document file corresponding to a search query, as an excerpt sentence. <P>SOLUTION: A network system S includes: a database 10; a search engine 20 for searching the database 10 by using a predetermined search algorithm; a terminal apparatus 30 to be used by a user; and a data processing apparatus 40 for preparing a document index for document search. The data processing apparatus 40 divides the document data of one document file into a plurality of document blocks, evaluates the proximity of a unit sentence including a word of interest, the number of keywords included, etc. in each document block, specifies a document block most appropriate for extracting an excerpt sentence as an important document block, and extracts a sentence including a keyword from the important document block as an excerpt sentence. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、データベース中の文書ファイルの検索処理が行われる場合に、その検索処理でヒットした文書ファイルに含まれる文章の一部を抜粋文として抽出し、検索結果として表示させる抜粋文抽出方法に関する。 The present invention relates to an excerpt sentence extraction method for extracting a part of a sentence contained in a document file hit in the search process as an excerpt sentence and displaying it as a search result when a search process of a document file in a database is performed. .

ワールドワイドなインターネット上や社内ネットワークのようなローカルネット上において、データベース中からユーザが望む文書ファイルを検索可能とした検索システムが汎用されている。このような検索システムにおいては、例えばキーワード検索が実行された場合のヒット文書のリストを端末機の表示画面に表示させるに際して、通常、各文書の抜粋文が併せて表示される。 2. Description of the Related Art A search system that can search a document file desired by a user from a database on a worldwide Internet or a local network such as an in-house network is widely used. In such a search system, for example, when a list of hit documents when a keyword search is executed is displayed on the display screen of the terminal, usually, excerpts of each document are displayed together.

Ｙａｈｏｏ（登録商標）やＧｏｏｇｌｅ（登録商標）などのポータルサイトを用いたインターネット検索では、文書ファイルの作成者が記述したＨＴＭＬ要約タグが存在する場合には、当該要約タグに係る文書が抜粋文として表示される。一方、上記ＨＴＭＬ要約タグが存在しない場合、或いは社内ローカルのポータルサイト等で要約タグの指定自体が不可である場合には、所定のルールに従って文書ファイル中の文書データの一部が抽出され、これを抜粋文として表示させている（例えば特許文献１参照）。この抽出ルールとしては、文書データ中で検索キーワードが最初（或いは中間、若しくは最終）に登場する行を特定し、その特定行を中心とした前後数行を抽出する方法や、検索キーワードが存在する行を特定することなく、単純にその文書データの先頭から数行を抽出する方法などが一般的である。 In an Internet search using a portal site such as Yahoo (registered trademark) or Google (registered trademark), if there is an HTML summary tag described by the creator of the document file, the document related to the summary tag is used as an excerpt. Is displayed. On the other hand, if the HTML summary tag does not exist, or if it is impossible to specify the summary tag itself in a portal site in the company, a part of the document data in the document file is extracted according to a predetermined rule. Is displayed as an excerpt (see, for example, Patent Document 1). As this extraction rule, there is a method of identifying a line in which the search keyword appears first (or intermediate or final) in the document data, and extracting several lines around the specific line, or a search keyword. A method of simply extracting several lines from the top of the document data without specifying the lines is common.

特開２００８−１７６６８５号公報JP 2008-176665 A

上記抜粋文は、検索処理を行ったユーザが、自己の検索ニーズにマッチした文書を捜す際の手掛かりとなる。しかしながら、上掲のような機械的な抜粋文抽出方法では、検索キーワードに関わりが深い記述部分を抜粋文として抽出できる可能性は高くない。例えば、検索キーワードが最初に出現する行の前後部分の記述が当該文書のエッセンス部分とは限らず、その行よりも後の部分に主要部が存在している場合も多々ある。この場合、ユーザが望むコンテンツを含む文書がヒットしながら、ユーザがこれを見逃してしまうことが生じ得る。 The excerpt is a clue when the user who has performed the search process searches for a document that matches his search needs. However, in the mechanical excerpt sentence extraction method as described above, it is not highly likely that a description part closely related to the search keyword can be extracted as an excerpt sentence. For example, the description before and after the line in which the search keyword first appears is not necessarily the essence part of the document, and there are many cases where the main part exists in the part after the line. In this case, the user may miss this while hitting a document containing the content desired by the user.

本発明の目的は、データベースに対する文書検索処理において、検索クエリに対応した文書ファイルの要点部分を的確に抜粋文として抽出することができる抜粋文抽出方法を提供することにある。 An object of the present invention is to provide an excerpt sentence extraction method capable of accurately extracting a main part of a document file corresponding to a search query as an excerpt sentence in a document search process for a database.

上記目的を達成する本発明の一の局面に係る抜粋文抽出方法は、所定のクエリを与えてデータベース中の文書ファイルの検索処理が行われ、その検索結果として、前記クエリに合致する文書ファイルに含まれる文章の一部を抜粋文として抽出して表示させる場合における前記抜粋文の抽出方法であって、前記文書ファイル中の文書データに含まれる単語群の中から一の注目単語を定めるステップと、一の文書ファイルの文書データを、所定の文字群を含む単位文章の複数に区画されたものと扱い、一連の文章中において前記注目単語が含まれる単位文章が複数個含まれる領域を、前記文書データ中において文章ブロックとして定義するステップと、前記文章ブロックの各々において、前記一連の文章中における前記単位文章の出現間隔を評価し、前記出現間隔の短さを指標として重要文章ブロックを特定するステップと、前記重要文章ブロックの中から、前記注目単語を含む文章を前記抜粋文として抽出するステップと、を含むことを特徴とする（請求項１）。 The extracted sentence extraction method according to one aspect of the present invention that achieves the above object is to perform a search process of a document file in a database by giving a predetermined query, and to the document file that matches the query as a search result. A method of extracting the excerpt sentence when extracting and displaying a part of the contained sentence as an excerpt sentence, the step of determining one attention word from a word group included in the document data in the document file; The document data of one document file is treated as being divided into a plurality of unit sentences including a predetermined character group, and an area including a plurality of unit sentences including the attention word in a series of sentences, Defining a sentence block in the document data, and evaluating the appearance interval of the unit sentence in the series of sentences in each of the sentence blocks A step of identifying an important sentence block using the shortness of the appearance interval as an index, and a step of extracting the sentence including the attention word from the important sentence block as the excerpt sentence. (Claim 1).

この構成によれば、例えば検索キーワードに対応する単語が上記注目単語として定められ、この注目単語を含む単位文書が特定され、さらに当該単位文書が複数個含まれる文書データ中の領域が、一つの文書ブロックとして特定される。そして、文書ブロックの各々において、注目単語を含む単位文書の出現間隔が評価され、出現間隔の短いものが重要文書ブロックとして特定される。これは、文書データ中において注目単語が集中的に出現している記述部分を特定することに繋がり、そのような記述部分は注目単語に関連した詳細な論述等が存在すると推定され、抜粋文としての適正を有する可能性が高いからである。このような重要文書ブロックから、注目単語を含む文章が抜粋文として抽出されるので、精度良く抜粋文を抽出することができる。 According to this configuration, for example, a word corresponding to a search keyword is determined as the attention word, a unit document including the attention word is specified, and an area in document data including a plurality of the unit documents is a single region. Identified as a document block. Then, in each document block, the appearance interval of the unit document including the word of interest is evaluated, and the document having a short appearance interval is identified as the important document block. This leads to specifying the description part where the attention word appears intensively in the document data, and it is presumed that such a description part has a detailed description related to the attention word, and as an excerpt sentence This is because there is a high possibility of having the appropriateness. Since the sentence including the attention word is extracted as an excerpt from such an important document block, the excerpt can be extracted with high accuracy.

上記構成において、前記文章ブロックとして定義するステップは、複数の前記単位文章の中で、前記文書データにおいて前記注目単語が第１番目に出現する第１単位文章と、前記注目単語が第２番目に出現する第２単位文章と、前記注目単語が第３番目に出現する第３単位文章とが特定され、前記第１単位文章から前記第３単位文章までの文章領域が第１文章ブロックとして定義され、前記注目単語が前記文書データにおいて第４番目に出現する第４単位文章が特定され、前記第２単位文章から前記第４単位文章までの文章領域が第２文章ブロックとして定義されるステップを含むことが望ましい（請求項２）。 In the above-described configuration, the step of defining the sentence block includes a first unit sentence in which the attention word appears first in the document data, and the attention word is second in the plurality of unit sentences. A second unit sentence that appears and a third unit sentence in which the attention word appears third are specified, and a sentence region from the first unit sentence to the third unit sentence is defined as a first sentence block. , A fourth unit sentence in which the word of interest appears fourth in the document data is specified, and a sentence region from the second unit sentence to the fourth unit sentence is defined as a second sentence block. It is desirable (Claim 2).

この構成によれば、３つの単位文章を１セットとして文章ブロック（第１文書ブロック）が特定され、起点となる単位文章が一つシフトした形で次の文章ブロック（第２文書ブロック）が特定されるようになる。一つの文書ブロック内に３つの単位文章が含まれるので単位文章の出現間隔が評価し易くなり、また、起点となる単位文章が一つずつシフトされながら次々に文書ブロックが定義されてゆくので、処理を簡素化することができる。 According to this configuration, a sentence block (first document block) is specified with three unit sentences as one set, and the next sentence block (second document block) is specified in a form in which the starting unit sentence is shifted by one. Will come to be. Since three unit sentences are included in one document block, it is easy to evaluate the appearance interval of unit sentences, and the document blocks are defined one after another while the starting unit sentences are shifted one by one. Processing can be simplified.

上記構成において、前記単位文章が、文章の１行であって、前記単位文章の出現間隔が、行間の長さで評価されることが望ましい（請求項３）。この構成によれば、行間を示すタグ等を利用して、単位文書の区画を容易に行うことができる。 In the above configuration, it is preferable that the unit sentence is one line of the sentence, and the appearance interval of the unit sentence is evaluated by a length between lines. According to this configuration, it is possible to easily divide a unit document using a tag or the like indicating a line spacing.

また、前記重要文章ブロックを特定するステップにおいて、さらに前記注目単語の出現頻度が評価対象とされ、前記出現頻度の多さを指標として前記重要文章ブロックが特定されることが望ましい（請求項４）。この構成によれば、重要文章ブロックの特定要件として、前記注目単語の出現頻度も評価されるので、より一層、精度良く抜粋文を抽出することが可能となる。 Further, in the step of identifying the important sentence block, it is preferable that the appearance frequency of the attention word is further evaluated, and the important sentence block is identified by using the appearance frequency as an index. . According to this configuration, since the appearance frequency of the attention word is also evaluated as the requirement for specifying the important sentence block, it is possible to extract the extracted sentence with higher accuracy.

上記構成において、前記重要文章ブロックを特定するステップにおいて、さらに名詞単語の出現頻度が評価対象とされ、前記注目単語を含む前記単位文章における名詞単語の出現頻度の多さを指標として前記重要文章ブロックが特定されることが望ましい（請求項５）。一般に、形容詞、副詞、動詞等に相当する単語よりも、名詞単語の方が情報を伝達し易いと言うことができる。従って、重要文章ブロックの特定要件として名詞単語の出現頻度の多さを指標とすることで、抜粋文の抽出精度をより向上させることができる。 In the above configuration, in the step of identifying the important sentence block, the appearance frequency of the noun word is further evaluated and the important sentence block is used as an index with the frequency of appearance of the noun word in the unit sentence including the attention word as an index. Is preferably specified (Claim 5). In general, it can be said that noun words are easier to convey information than words corresponding to adjectives, adverbs, verbs and the like. Therefore, the extraction accuracy of the excerpt sentence can be further improved by using the frequency of appearance of noun words as an index as the specific requirement for the important sentence block.

本発明によれば、データベースに対する文書検索処理において、検索クエリに対応した文書ファイルの要点部分を的確に抜粋文として抽出することができる。従って、ユーザが、検索ヒットリストから必要な情報を的確に得ることができる検索システムの提供に貢献することができる。 According to the present invention, in a document search process for a database, a main part of a document file corresponding to a search query can be accurately extracted as an extracted sentence. Therefore, it is possible to contribute to the provision of a search system that allows the user to accurately obtain necessary information from the search hit list.

本発明に係る抜粋文抽出方法が適用されるネットワークシステムのハードウェア構成を概略的に示す構成図である。It is a block diagram which shows roughly the hardware constitutions of the network system to which the excerpt sentence extraction method concerning this invention is applied. 本発明の実施形態に係る検索システムの概要を示す模式的フローチャートである。It is a typical flowchart which shows the outline | summary of the search system which concerns on embodiment of this invention. 検索結果の表示画面の一例を示す図である。It is a figure which shows an example of the display screen of a search result. 抜粋文抽出の比較例を示す模式図である。It is a schematic diagram which shows the comparative example of an extract sentence extraction. 検索システムの構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of a search system. 検索インデックスの作成を説明するための模式図である。It is a schematic diagram for demonstrating creation of a search index. 抜粋文抽出のプロセスを示す説明図である。It is explanatory drawing which shows the process of extraction sentence extraction. 抜粋文抽出のプロセスを示す説明図である。It is explanatory drawing which shows the process of extraction sentence extraction. 抜粋文抽出のプロセスを示す説明図である。It is explanatory drawing which shows the process of extraction sentence extraction. 文書インデックスの一例を示す表形式の図である。It is a figure of a table format which shows an example of a document index. 文書検索処理を示すフローチャートである。It is a flowchart which shows a document search process. 抜粋文抽出処理を示すフローチャートである。It is a flowchart which shows an extract sentence extraction process.

以下、図面に基づいて本発明の実施形態につき詳細に説明する。図１は、本発明の抜粋文抽出方法が適用されるネットワークシステムＳのハードウェア構成を概略的に示す構成図である。このネットワークシステムＳは、通信ネットワーク上において利用可能なデータベース１０と、このデータベース１０に対して所定の検索アルゴリズムを用いて検索処理を行う検索エンジン２０と、ユーザが利用する端末装置３０と、主に文書検索用の文書インデックスの作成処理を行うデータ処理装置４０とが、インターネットＩＮ又はローカルネットＬＮを介してデータ通信可能に接続されてなる。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram schematically showing the hardware configuration of a network system S to which the excerpt sentence extraction method of the present invention is applied. The network system S mainly includes a database 10 that can be used on a communication network, a search engine 20 that performs a search process on the database 10 using a predetermined search algorithm, a terminal device 30 that a user uses, A data processing apparatus 40 that performs processing for creating a document index for document search is connected to be able to perform data communication via the Internet IN or the local network LN.

データベース１０は、検索対象となる多数の文書ファイルが大量に記憶されているデータベースである。データベース１０がインターネットＩＮに接続されるものである場合、このデータベース１０は各々ドメイン名を持つ多数のウエブサイトの集合となる。このようなデータベース１０としては、例えばＹａｈｏｏ（登録商標）やＧｏｏｇｌｅ（登録商標）などのポータルサイトを通してアクセス可能なデータベースを例示することができる。また、データベース１０が企業等の内部で構築されているローカルネットＬＮに接続されるものである場合、このデータベース１０は当該企業内の共用文書ファイルを保管するデータベースとなる。 The database 10 is a database in which a large number of document files to be searched are stored. When the database 10 is connected to the Internet IN, the database 10 is a set of a large number of websites each having a domain name. As such a database 10, for example, a database accessible through a portal site such as Yahoo (registered trademark) or Google (registered trademark) can be exemplified. When the database 10 is connected to a local net LN built inside a company or the like, the database 10 is a database that stores a shared document file in the company.

検索エンジン２０は、所定のパラメータを有する検索アルゴリズムが搭載され、データベース１０に対し、与えられたクエリに対応する文書ファイルの検索処理を行う。すなわち検索エンジン２０は、クエリを文書解析して検索インデックスを作成すると共に、データベース１０に記憶されている検索対象の文書ファイル毎にキーワードや属性（メタデータ）を抽出して作成された検索用の文書インデックスを読み出す。さらに検索エンジン２０は、前記検索アルゴリズムを用いて、前記文書インデックスと前記検索インデックスとの照合処理を行い、合致度合いの高い（クエリに適合する）文書ファイルを抽出する。 The search engine 20 is equipped with a search algorithm having predetermined parameters, and performs a search process for a document file corresponding to a given query with respect to the database 10. That is, the search engine 20 analyzes a query to create a search index and extracts keywords and attributes (metadata) for each search target document file stored in the database 10 for search. Read the document index. Further, the search engine 20 performs a collation process between the document index and the search index using the search algorithm, and extracts a document file having a high degree of matching (matching a query).

端末装置３０は、多数のユーザに各々保有されるパーソナルコンピュータや携帯電話機、携帯情報端末機等の通信端末機３１、３２、３３、３４・・・である。端末装置３０は、インターネットＩＮ又はローカルネットＬＮを介して検索エンジン２０及びデータベース１０にアクセス可能とされている。例えば端末機３１は、検索処理条件となるキーワードや文章（クエリ）を入力するためのキーボード３１１と、検索画面、ヒット文書のリスト、文書のコンテンツ等を表示するディスプレイ３１２と、ディスプレイ３１２上の表示画面に対してコマンド（文書の選択指示や選択終了指示等）を入力するためのマウス３１３を備えている。 The terminal device 30 is a communication terminal 31, 32, 33, 34... Such as a personal computer, a mobile phone, or a personal digital assistant owned by many users. The terminal device 30 can access the search engine 20 and the database 10 via the Internet IN or the local net LN. For example, the terminal 31 has a keyboard 311 for inputting keywords and sentences (queries) as search processing conditions, a display 312 for displaying a search screen, a list of hit documents, document contents, and the like, and a display on the display 312. A mouse 313 for inputting a command (such as a document selection instruction or a selection end instruction) to the screen is provided.

各ユーザは、各自の通信端末機３１、３２、３３、３４・・・のキーボード３１１を介して、検索エンジン２０にクエリを与え、そのクエリでの検索処理によりヒットした文書ファイルのリストを取得する。さらにユーザは、前記リストの中から希望する文書ファイルを選択し、その文書ファイルの内容を自身のディスプレイ３１２で表示させることができる。 Each user gives a query to the search engine 20 via the keyboard 311 of the respective communication terminal 31, 32, 33, 34..., And obtains a list of document files hit by the search processing using the query. . Further, the user can select a desired document file from the list and display the contents of the document file on his / her display 312.

データ処理装置４０は、データベース１０に記憶されている検索対象の文書ファイル毎に文書解析処理を行ってキーワードや属性を抽出し、検索用の文書インデックスを作成する。さらに、本実施形態の検索処理では、前記クエリに合致する文書ファイルに含まれる文章の一部を抜粋文として抽出し、この抜粋文もディスプレイ３１２にて表示させるが、この抜粋文として適切な記述部分を文書インデックスの単語（注目単語）に関連付けて抽出する処理を行う。このデータ処理装置４０の具体的構成については、後記で詳述する。 The data processing device 40 performs document analysis processing for each search target document file stored in the database 10, extracts keywords and attributes, and creates a search document index. Further, in the search processing of the present embodiment, a part of the sentence included in the document file that matches the query is extracted as an excerpt, and this excerpt is also displayed on the display 312. A process of extracting the part in association with the word (word of interest) in the document index is performed. A specific configuration of the data processing device 40 will be described in detail later.

図２は、ネットワークシステムＳにより実行される文書ファイル検索処理の概要を示す模式的なフローチャートである。検索エンジン２０は、検索元側の処理として、端末装置３０からクエリが与えられると、このクエリ文書に対して例えば形態素解析等を含む文書解析処理を行う（ステップＳ１）。次いで、上記の文書解析処理で得られた単語をベースにして検索インデックスが作成される（ステップＳ２）。この検索インデックスは、主に質問事象に関連深いキーワード群からなる。そして、当該検索インデックスを用いて、大量のデータを含むデータベース１０に対して文書の検索処理を行う（ステップＳ３）。 FIG. 2 is a schematic flowchart showing an outline of the document file search process executed by the network system S. When a query is given from the terminal device 30 as processing on the search source side, the search engine 20 performs document analysis processing including, for example, morphological analysis on the query document (step S1). Next, a search index is created based on the words obtained by the document analysis process (step S2). This search index mainly consists of a group of keywords that are closely related to the question event. Then, using the search index, a document search process is performed on the database 10 including a large amount of data (step S3).

また、検索先側の処理として、データ処理装置４０は、メタデータの抽出のための文書解析（ステップＳ０１）、及び文書インデックスの作成（ステップＳ０２）を定期的に行う。ステップＳ３の検索処理で得られた結果は、前記クエリを入力したユーザの端末装置３０に、検索アルゴリズムでの合致度合いに応じた順位付けをしてリスト出力（ランキング表示）される（ステップＳ４）。 As processing on the search destination side, the data processing device 40 periodically performs document analysis for extracting metadata (step S01) and creation of a document index (step S02). The results obtained in the search process in step S3 are output as a list (ranking display) after ranking according to the degree of match in the search algorithm to the terminal device 30 of the user who has input the query (step S4). .

図３は、上記のステップＳ４において、ディスプレイ３１２に表示される検索結果表示画面の一例を示す図である。ここでは、「関西電力」というキーワードでインターネット検索が為された場合であって、ヒットした文書ファイルについて、当該文書ファイルの作成者が記述したＨＴＭＬ要約タグが存在する場合を例示している。図３に示すように、ヒット文書のタイトルの下欄には、当該文書のコンテンツ概要を示す抜粋文（要約文）が表示されている。この抜粋文は、上記のＨＴＭＬ要約タグが表示されたものである。 FIG. 3 is a diagram showing an example of a search result display screen displayed on the display 312 in step S4. Here, a case where an Internet search is performed with the keyword “Kansai Electric Power” and an HTML summary tag described by the creator of the document file exists for the hit document file is illustrated. As shown in FIG. 3, an excerpt sentence (summary sentence) indicating an outline of the content of the document is displayed in the lower column of the hit document title. This excerpt is a display of the above HTML summary tag.

一方、インターネット上の文書ファイルにおいて上記のようなＨＴＭＬ要約タグが存在しない場合、或いは社内ローカルのポータルサイト等で要約タグの指定が不可である場合には、所定のルールに従って文書ファイル中の文書データの一部が抽出され、これが抜粋文として表示される。ユーザは、この抜粋文を参照することでヒットした文書ファイルのコンテンツの概要を知見することができ、いちいち各文書ファイルを開いて内容を確認する手間を省くことができる。それゆえ、抜粋文の内容がクエリに沿ったものであることが重要となる。もし精度の悪い抜粋文であるならば、ユーザが望む記述を含む文書ファイルでありながら、ユーザが当該文書ファイルをスルーしてしまう可能性がある。 On the other hand, if there is no HTML summary tag as described above in a document file on the Internet, or if it is impossible to specify a summary tag on a portal site in the company, document data in the document file according to a predetermined rule Is extracted and displayed as an excerpt. The user can know the outline of the contents of the hit document file by referring to the excerpt, and can save time and effort to open each document file and check the contents. Therefore, it is important that the content of the excerpt is in line with the query. If the extracted sentence is inaccurate, the user may pass through the document file even though the document file includes a description desired by the user.

図４は、要約タグが存在しない文書ファイルについての、従来の一般的な抜粋文抽出のルールを模式的に示す図である。ここでは、端末装置３０から検索キーワード「Ａ」が検索エンジン２０に与えられ、検索処理の結果としてキーワード「Ａ」を含むヒット文書が抽出された例を示している。この抽出ルールは、ヒット文書の文書データ中において検索キーワード「Ａ」が最初に登場する行Ｌ１を特定し、その特定行Ｌ１を中心とした前後数行を抜粋文Ｄ１として抽出するものである。 FIG. 4 is a diagram schematically showing a conventional general extracted sentence extraction rule for a document file having no summary tag. In this example, the search keyword “A” is given from the terminal device 30 to the search engine 20 and a hit document including the keyword “A” is extracted as a result of the search process. This extraction rule specifies the line L1 in which the search keyword “A” first appears in the document data of the hit document, and extracts several lines around the specific line L1 as extracted sentences D1.

しかしながら、キーワード「Ａ」が最初に登場する行Ｌ１の近辺に、当該ヒット文書におけるキーワード「Ａ」の要部が記述されているとは限らない。図４では、符号Ｄ２で示す領域にキーワード「Ａ」が頻出し、キーワード「Ａ」に関して詳細な記述がなされている部分が存在する例を示している。本来、抜粋文として相応しいのは領域Ｄ２の記述部分である可能性が高いが、上記の抽出ルールではこのような領域Ｄ２を抜粋文として抽出することができない不具合がある。 However, the main part of the keyword “A” in the hit document is not always described near the line L1 in which the keyword “A” first appears. FIG. 4 shows an example in which the keyword “A” appears frequently in the area indicated by the symbol D2, and there is a portion where a detailed description is made regarding the keyword “A”. Originally, it is highly possible that a description part of the area D2 is suitable as an excerpt sentence, but the above extraction rule has a problem that such an area D2 cannot be extracted as an excerpt sentence.

そこで本実施形態では、抜粋文として相応しい文章を次の（１）、（２）のように定義し、そのような文章が抜粋されるような処理を行うようにしている。
（１）文章間（話題）に連続性がある。
（２）そのキーワードについて多く話されている。
上記（１）の点を機械的に判定する要素としては、キーワードを含む記述文章が近接していることが挙げられる。また、上記（２）の点については、そのキーワードを多く含んでいることが挙げられる。 Therefore, in the present embodiment, a sentence suitable as an excerpt sentence is defined as shown in the following (1) and (2), and processing is performed so that such a sentence is excised.
(1) There is continuity between sentences (topics).
(2) Much talk about the keyword.
As an element for mechanically determining the point (1), description sentences including keywords are close to each other. Moreover, about the point of said (2), it is mentioned that many the keywords are included.

このような定義にマッチする抜粋文を抽出するために、大略的に本実施形態では、一の文書ファイルの文書データを複数の文書ブロックに区分し、各文書ブロック中において上述のキーワード（注目単語）を含む単位文章の近接性やキーワードの含有数などを評価する。そして、上記の定義に最もマッチする文書ブロックを重要文書ブロックとして特定し、この重要文書ブロックの中からキーワードを含む文章を抜粋文として抽出する。本実施形態において、このような処理はデータ処理装置４０により実行されるものであり、その詳細な処理は次述のネットワークシステムＳの機能構成の説明において行う。 In order to extract an excerpt that matches such a definition, in the present embodiment, the document data of one document file is roughly divided into a plurality of document blocks, and the above-described keyword (word of interest) is included in each document block. ) And the number of keywords included. Then, a document block that most closely matches the above definition is identified as an important document block, and a sentence including a keyword is extracted as an excerpt from the important document block. In the present embodiment, such processing is executed by the data processing device 40, and detailed processing thereof will be performed in the description of the functional configuration of the network system S described below.

以下、図５〜図１２に基づいて、本実施形態に係る抜粋文抽出方法について詳述する。図５は、本実施形態に係るネットワークシステムＳの機能構成を示す機能ブロック図である。データベース１０は、各種の文書ファイルが記憶された複数の文書サーバ１１、１２、１３・・・（Ｗｅｂサーバ又は社内ローカルサーバ）を含む。検索エンジン２０は、文書インデックス記憶部２１、検索インデックス作成部２２、検索処理部２３及びランキング表示処理部２４を機能的に備えている。端末装置３０は、クエリ入力部３０１、表示部３０２及び操作部３０３を備えている。データ処理装置４０は、文書抽出部４１、文書解析部４２、重み算出部４３、記憶部４４及び抜粋文抽出部４５を備えている。 Hereinafter, the extracted sentence extraction method according to the present embodiment will be described in detail with reference to FIGS. FIG. 5 is a functional block diagram showing a functional configuration of the network system S according to the present embodiment. The database 10 includes a plurality of document servers 11, 12, 13... (Web server or in-house local server) in which various document files are stored. The search engine 20 functionally includes a document index storage unit 21, a search index creation unit 22, a search processing unit 23, and a ranking display processing unit 24. The terminal device 30 includes a query input unit 301, a display unit 302, and an operation unit 303. The data processing device 40 includes a document extraction unit 41, a document analysis unit 42, a weight calculation unit 43, a storage unit 44, and an excerpt sentence extraction unit 45.

文書インデックス記憶部２１は、データベース１０に含まれる文書ファイルを検索させるためのインデックス（文書インデックス）を記憶する。文書インデックスは、所定の作成タイミング（例えば１日１回）にデータ処理装置４０により作成され、記憶内容が更新される。 The document index storage unit 21 stores an index (document index) for searching for a document file included in the database 10. The document index is created by the data processing device 40 at a predetermined creation timing (for example, once a day), and the stored content is updated.

検索インデックス作成部２２は、端末装置３０から与えられたクエリ（キーワードや質問文書）を文書解析し、検索インデックスを作成する処理を行う。例えば、クエリが質問文書である場合は、その文書内において自立する単語として抽出すると共に、これら単語の出現頻度等を参照して重み付けする等して、検索インデックスを作成する。図６は、検索インデックステーブルの一例を示す模式図である。ここでは、クエリから自立単語Ａ、Ｂ、Ｃ・・・が抽出され、各単語Ａ、Ｂ、Ｃ・・・に対して所定の算出式で導出された重み値（Ｓｉ）が付与されている例を示している。 The search index creation unit 22 performs document analysis on a query (keyword or question document) given from the terminal device 30 and creates a search index. For example, when the query is a question document, a search index is created by extracting the words as independent words in the document and weighting them by referring to the appearance frequency of these words. FIG. 6 is a schematic diagram illustrating an example of a search index table. Here, the independent words A, B, C... Are extracted from the query, and weight values (Si) derived by a predetermined calculation formula are assigned to the words A, B, C. An example is shown.

検索処理部２３は、検索アルゴリズムを用いて、データベース１０に対して文書ファイルの検索処理を行う。具体的には、前記検索インデックスと前記文書インデックスとを照合し、検索インデックス（クエリ）に対する類似度が高い文書ファイルを抽出する。ここで、この検索処理には、予め設定された検索アルゴリズムが用いられる。この検索アルゴリズムとしては、例えばベクトル空間モデル（コサイン尺度）、Ｄｉｃｓ係数、Ｊａｃｃａｒｄ係数、Ｔスコア、相互情報量、Ｓｉｍｐｕｓｏｎ係数などから選ばれる。これらの検索アルゴリズムは各々計算式を有し、その計算式において種々のパラメータが設定される。 The search processing unit 23 performs a document file search process on the database 10 using a search algorithm. Specifically, the search index and the document index are collated, and a document file having a high similarity to the search index (query) is extracted. Here, a preset search algorithm is used for this search process. This search algorithm is selected from, for example, a vector space model (cosine scale), a Dicks coefficient, a Jaccard coefficient, a T score, a mutual information amount, and a Simpson coefficient. Each of these search algorithms has a calculation formula, and various parameters are set in the calculation formula.

ランキング表示処理部２４は、検索処理部２３による検索処理でヒットした複数の文書ファイルを、クエリに対する類似度（合致度合い）が高い順に順位付けしたリストを作成する。このリストは、端末装置３０によりブラウジングが可能であり、実際は検索処理の完了後に表示部３０２で表示される。この表示の際、各文書ファイルのタイトルに付随して、上述の抜粋文が表示される。 The ranking display processing unit 24 creates a list in which a plurality of document files hit in the search processing by the search processing unit 23 are ranked in descending order of similarity (matching degree) to the query. This list can be browsed by the terminal device 30 and is actually displayed on the display unit 302 after the search process is completed. At the time of this display, the above-mentioned excerpt sentence is displayed along with the title of each document file.

端末装置３０のクエリ入力部３０１は、ユーザから検索処理のためのキーワード等のクエリの入力を受け付ける部位であって、例えば図１に示すキーボード３１１である。 The query input unit 301 of the terminal device 30 is a part that receives an input of a query such as a keyword for search processing from a user, and is, for example, the keyboard 311 shown in FIG.

表示部３０２は、例えば図１に示すディスプレイ３１２であって、検索エンジン２０のブラウジング画面（クエリ入力画面）、ヒット文書ファイルのリスト並びにその抜粋文、及び前記リストから選択された文書ファイルの内容等を表示する。 The display unit 302 is, for example, the display 312 shown in FIG. 1, and includes a browsing screen (query input screen) of the search engine 20, a list of hit document files and excerpts thereof, contents of document files selected from the list, and the like. Is displayed.

操作部３０３は、例えば図１に示すマウス３１３であって、表示部３０２に表示された画像上の選択部、リンク部、タスクバー等に対して操作指示を与える。ユーザは、表示部３０２に表示されたヒット文書ファイルのリストのいずれかを選択しその内容を表示させる選択指示、及び、その表示を停止させる選択終了指示を、この操作部３０３を介して与える。 The operation unit 303 is, for example, the mouse 313 illustrated in FIG. 1, and gives an operation instruction to the selection unit, the link unit, the task bar, and the like on the image displayed on the display unit 302. The user gives a selection instruction to select one of the hit document file lists displayed on the display unit 302 to display the contents and a selection end instruction to stop the display via the operation unit 303.

データ処理装置４０は、データベース１０に含まれる文書ファイルを検索させるための文書インデックスを作成する。データ処理装置４０の文書抽出部４１は、文書インデックス作成処理のトリガとして、所定の周期毎に、文書データベース１０から複数の文書ファイルを抽出する処理を行う。 The data processing device 40 creates a document index for searching for document files included in the database 10. The document extraction unit 41 of the data processing device 40 performs a process of extracting a plurality of document files from the document database 10 at predetermined intervals as a trigger for the document index creation process.

文書解析部４２は、文書抽出部４１により抽出された各文書ファイルが含む文書データに対して、正規化処理、文書構造解析処理、同義語処理などの文書解析処理を行い、文書データを単語単位に分割する。正規化処理は、文書構造解析を正常に行い得るようにするために、解析対象文書から解析に不要な文字、記号等を削除すると共に、全角・半角文字の統一等を行う処理である。文書構造解析処理は、正規化処理後の分類対象文書に対しそれぞれ、形態素解析を実施して文書を単語単位に分割する処理、構文解析処理を実施して単語間の係り受け（名詞と動詞との関係付け等）を特定する処理などである。同義語処理は、同義語及び表記の揺れを吸収するシソーラス（同義語辞書）を活用して、表現が異なる用語を単一の単語と扱うようにする処理である。かかる文書解析処理により、自立する単語（キーワード）が導出される。文書インデックスの素地となる基礎インデックス（重みが未だ与えられていないインデックス）は、ここで導出された単語で作成される。 The document analysis unit 42 performs document analysis processing such as normalization processing, document structure analysis processing, synonym processing on the document data included in each document file extracted by the document extraction unit 41, and converts the document data into word units. Divide into The normalization process is a process for deleting characters and symbols unnecessary for analysis from the analysis target document and unifying full-width and half-width characters so that the document structure analysis can be performed normally. In the document structure analysis process, the classification target documents after the normalization process are each subjected to a morphological analysis to divide the document into words, and a parsing process to perform dependency between words (nouns and verbs). And the like. The synonym process is a process in which terms having different expressions are treated as a single word by utilizing a thesaurus (synonym dictionary) that absorbs fluctuations of synonyms and notations. By such document analysis processing, independent words (keywords) are derived. A basic index (an index to which weight is not yet given) as a basis for the document index is created with the words derived here.

重み算出部４３は、前記基礎インデックスに列記された単語について、所定の条件で重み付けを付与して、文書インデックスを完成させる。この重み付け設定処理において汎用されているのは、単語の出現頻度であり、文書データ中において出現頻度が多い単語には高い重みが設定される。 The weight calculation unit 43 assigns weights to the words listed in the basic index under predetermined conditions to complete the document index. What is widely used in this weighting setting process is the appearance frequency of words, and a high weight is set for words that frequently appear in the document data.

記憶部４４は、文書解析部４２が作成する基礎インデックスや、重み算出部４３が算出した重みデータを一時的に記憶する他、後述の抜粋文抽出部４５における各種の作業データ等を一次的に記憶する作業領域として用いられるメモリである。重みの算出式や行間距離の算出式等も、この記憶部４４に格納される。 The storage unit 44 temporarily stores the basic index created by the document analysis unit 42 and the weight data calculated by the weight calculation unit 43, and temporarily stores various work data in the excerpt sentence extraction unit 45 described later. It is a memory used as a work area for storing. Weight calculation formulas, line spacing calculation formulas, and the like are also stored in the storage unit 44.

抜粋文抽出部４５は、文書インデックスとして抽出された単語毎に、文書データ内から当該単語を含む抜粋文を抽出する処理を行う。この抜粋文は、当該単語が検索キーワードとして用いられた場合に、その単語を含む文書の要旨として表示される文章である。通常、一つの文書ファイルの文書データに対して複数の文書インデックス用の単語が抽出されるが、これらの単語の各々に対応付けて抜粋文が抽出される。抜粋文抽出部４５は、機能的に、注目単語設定部４５１、ブロック分割部４５２、行間距離算出部４５３、ブロック選択部４５４及び抜粋行選択部４５５を備えている。 The excerpt sentence extraction unit 45 performs a process of extracting an excerpt sentence including the word from the document data for each word extracted as the document index. This excerpt is a sentence that is displayed as the gist of a document containing the word when the word is used as a search keyword. Usually, a plurality of document index words are extracted from the document data of one document file, and an excerpt sentence is extracted in association with each of these words. The excerpt sentence extraction unit 45 functionally includes an attention word setting unit 451, a block division unit 452, an interline distance calculation unit 453, a block selection unit 454, and an excerpt line selection unit 455.

注目単語設定部４５１は、文書ファイル毎に、文書解析部４２が抽出した基礎インデックスの単語群の中から、一の単語を順次抽出する。具体的には注目単語設定部４５１は、解析対象となる文書ファイルにナンバリングを施すと共に、各文書ファイルの単語にもナンバリングを施し、順次、抜粋文抽出を行う単語として設定する処理を行う。 The attention word setting unit 451 sequentially extracts one word from the word group of the basic index extracted by the document analysis unit 42 for each document file. Specifically, the attention word setting unit 451 performs processing for numbering the document file to be analyzed and numbering the word of each document file, and sequentially setting the extracted word as a word for extraction.

ブロック分割部４５２は、一の文書ファイルの文書データを、所定の文字群を含む単位文章の複数に区画されたものと扱い、一連の文章中において前記注目単語が含まれる単位文章が複数個含まれる領域を、前記文書データ中において文章ブロックとして定義する処理を行う。ここに、単位文章として好ましく扱えるのは文章の１行である。この場合、行間を示すタグ等を利用して、単位文章の区画を容易に行うことができる。或いは、所定の文字数単位（例えば５０文字単位）、若しくはビット数単位で単位文章を定めるようにしても良い。また、一の文章ブロックに対して、注目単語が含まれる単位文章は複数個であれば良いが、好ましい単位文章の含有数は３個である。ブロック分割部４５２は、このような文章ブロックを、一の文書ファイルの文書データ中に複数設定する。 The block division unit 452 treats the document data of one document file as being divided into a plurality of unit sentences including a predetermined character group, and includes a plurality of unit sentences including the attention word in a series of sentences. The area to be defined is defined as a sentence block in the document data. Here, it is one line of a sentence that can be preferably handled as a unit sentence. In this case, division of unit sentences can be easily performed using a tag or the like indicating a line spacing. Alternatively, a unit sentence may be determined in a predetermined number of characters (for example, 50 characters) or in units of bits. In addition, a single sentence block may include a plurality of unit sentences including the word of interest, but a preferable number of unit sentences is three. The block dividing unit 452 sets a plurality of such text blocks in the document data of one document file.

図７、図８に基づき、ブロック分割部４５２の処理の好ましい具体例を説明する。図７に示すように、いま、一の文書ファイルの文書データＸにおいて、注目単語設定部４５１において単語“Ａ”が注目単語として設定されているものとする。文書データＸに含まれる一連の文章中において、単語Ａは、１行目〜１６行目（複数の単位文章）の中で、２行目、８行目、９行目、及び１４行目に出現しているものとする。 A preferable specific example of the processing of the block dividing unit 452 will be described with reference to FIGS. As shown in FIG. 7, it is assumed that the word “A” is set as the attention word in the attention word setting unit 451 in the document data X of one document file. In a series of sentences included in the document data X, the word A is in the second line, the eighth line, the ninth line, and the 14th line among the first line to the 16th line (a plurality of unit sentences). It is assumed that it has appeared.

この場合、ブロック分割部４５２は、図８に示すように、先ず単語Ａが第１番目に出現する第１単位文章（ここでは２行目の文章）と、これに続いて第２番目に出現する第２単位文章（８行目の文章）と、第３番目に出現する第３単位文章（９行目の文章）とを特定し、前記第１単位文章から前記第３単位文章までの文章領域を第１文章ブロックＢ１として定義する。つまり、３つの単語Ａを含む単位文章を１セットとして１番目の文章ブロックを特定する。 In this case, as shown in FIG. 8, the block dividing unit 452 firstly displays the first unit sentence (here, the sentence on the second line) in which the word A appears first, followed by the second one. A second unit sentence (sentence on the 8th line) and a third unit sentence appearing on the 3rd (sentence on the 9th line), and sentences from the first unit sentence to the third unit sentence The area is defined as the first sentence block B1. That is, the first sentence block is specified with a unit sentence including three words A as one set.

次に、ブロック分割部４５２は、単語Ａが第２番目に出現する第２単位文章（８行目の文章）を起点として、これに続いて第３番目に出現する第３単位文章（９行目の文章）と、第４番目に出現する第４単位文章（１１行目の文章）とを特定し、前記第２単位文章から前記第４単位文章までの文章領域を第２文章ブロックＢ２として定義する。つまり、起点となる単位文章を一つ下行側へシフトした形で２番目の文章ブロックを特定する。同様にして、ブロック分割部４５２は、単語Ａが第３番目に出現する第３単位文章（９行目の文章）を起点として、これに続いて第４番目に出現する第４単位文章（１１行目の文章）と、第５番目に出現する第５単位文章（１４行目の文章）とを特定し、前記第３単位文章から前記第５単位文章までの文章領域を第３文章ブロックＢ３として定義する。以下、文書データＸが続く限りにおいて、同様な文章ブロックの設定処理が行われる。これにより、領域を異にし、単語Ａを含む単位文章を３つずつ含む複数の文章ブロックが、文書データＸに対して定義されることとなる。この手法によれば、起点となる単位文章が一つずつシフトされながら次々に文書ブロックが定義されてゆくので、ブロック分割部４５２の処理を簡素化することができる。 Next, the block division unit 452 starts from the second unit sentence (sentence on the eighth line) in which the word A appears second, and then starts the third unit sentence (9 lines) that appears third. Eye sentence) and the fourth unit sentence appearing in the fourth (sentence on the 11th line), and the sentence area from the second unit sentence to the fourth unit sentence is defined as a second sentence block B2. Define. That is, the second sentence block is specified by shifting the starting unit sentence one line downward. Similarly, the block dividing unit 452 starts from the third unit sentence (sentence on the ninth line) in which the word A appears third, and then starts the fourth unit sentence (11 that appears fourth). The fifth sentence appearing in the fifth line (sentence in the fourteenth line), and the sentence area from the third unit sentence to the fifth unit sentence is defined as the third sentence block B3. Define as Thereafter, as long as the document data X continues, a similar sentence block setting process is performed. Thus, a plurality of sentence blocks including three unit sentences including the word A in different areas are defined for the document data X. According to this method, document blocks are defined one after another while the starting unit sentences are shifted one by one, so that the processing of the block dividing unit 452 can be simplified.

行間距離算出部４５３は、各文章ブロックＢ１、Ｂ２、Ｂ３・・・において、単語Ａが出現する単位文章間の行間距離Ｓｎに基づいて、行間の短さ（単位文章の出現間隔の短さ）を評価するための指標値を導出する。この指標値としては、行間距離総乗の逆数と、ベクトルサイズの逆数とを例示することができる。 The line-to-line distance calculation unit 453 shortens the distance between lines (shortness of the appearance interval of unit sentences) based on the line-to-line distance Sn between unit sentences in which the word A appears in each sentence block B1, B2, B3. The index value for evaluating is derived. As this index value, the reciprocal of the inter-line distance square and the reciprocal of the vector size can be exemplified.

図９（Ａ）には、図８に例示した第１文章ブロックＢ１が示されている。この第１文章ブロックＢ１の場合、行間距離Ｓｎとしては、第１単位文章（２行目の文章）と第２単位文章（８行目の文章）との間の行間距離Ｓ１と、第２単位文章と第３単位文章（９行目の文章）との間の行間距離Ｓ２とが存在する。ここに、Ｓ１＝６（行）、Ｓ２＝１（行）である。これらＳ１、Ｓ２の数値を用いて、行間距離総乗逆数（Ｆ）は、次の（１）式で求めることができる。この（１）式の値が大きいほど、行間の総合距離が短いと評価することができる。 FIG. 9A shows the first sentence block B1 illustrated in FIG. In the case of the first sentence block B1, the line-to-line distance Sn is the line-to-line distance S1 between the first unit sentence (the second line sentence) and the second unit sentence (the eighth line sentence), and the second unit. There is an inter-line distance S2 between the sentence and the third unit sentence (the sentence on the ninth line). Here, S1 = 6 (row) and S2 = 1 (row). By using these numerical values of S1 and S2, the inter-line distance sum reciprocal (F) can be obtained by the following equation (1). It can be evaluated that the larger the value of the equation (1), the shorter the total distance between rows.

ベクトルサイズ逆数（Ｊ）を用いる方法は、行間距離Ｓｎを各文章ブロックのベクトル要素を見なす方法である。図９（Ａ）の事例では、第１文章ブロックＢ１は、図９（Ｂ）に示すように、Ｓ１及びＳ２の二次元ベクトルで表される。Ｓ１＝６（行）、Ｓ２＝１（行）であるので、そのベクトルｂ＝（６，１）である。このベクトルサイズの逆数（Ｊ）は、
Ｊ＝１／ベクトルｂ＝１／√（６^２＋１^２）＝０．１６４４
となるが、この値が大きいほど、行間の総合距離が短いと評価することができる。例えば、第２文章ブロックＢ２の場合、Ｓ１＝１（行）、Ｓ２＝２（行）であるので、そのベクトルサイズの逆数（Ｊ）は、
Ｊ＝１／ベクトルｂ＝１／√（１^２＋２^２）＝０．４４７２
と大きな値となり、第２文章ブロックＢ２の方が第１文章ブロックＢ１よりも行間の総合距離が短いと評価される。このベクトルサイズ逆数（Ｊ）は、次の（２）式で一般的に求めることができる。 The method using the reciprocal vector size (J) is a method in which the line spacing Sn is regarded as a vector element of each sentence block. In the example of FIG. 9A, the first sentence block B1 is represented by a two-dimensional vector of S1 and S2, as shown in FIG. 9B. Since S1 = 6 (row) and S2 = 1 (row), the vector b = (6, 1). The reciprocal (J) of this vector size is
J = 1 / vector b = 1 / √ (6 ² +1 ² ) = 0.164
However, it can be evaluated that the larger the value, the shorter the total distance between the rows. For example, in the case of the second sentence block B2, since S1 = 1 (line) and S2 = 2 (line), the reciprocal (J) of the vector size is
J = 1 / vector b = 1 / √ (1 ² +2 ² ) = 0.4472
It is evaluated that the total distance between lines is shorter in the second sentence block B2 than in the first sentence block B1. This vector size reciprocal (J) can be generally obtained by the following equation (2).

以上は、単語Ａが出現する単位文章間の行間の総合的な短さを評価する指標である。これに加えて、行間距離算出部４５３は、極端に行間距離Ｓｎが長い文章を含む文章ブロックを排除するための指標値も導出する。この指標値としては、行間距離Ｓｎの標準偏差に着目した行間標準偏差逆数（Ｇ）を例示することができる。この行間標準偏差逆数（Ｇ）は、次の（３）式で求めることができる。例えば、第１文章ブロックＢ１のようにＳ１＝６（行）、Ｓ２＝１（行）とバラツキが大きい場合には標準偏差は大きくなるので、その逆数を取る行間標準偏差逆数（Ｇ）の値は小さくなる。 The above is an index for evaluating the overall shortness between lines between unit sentences in which the word A appears. In addition to this, the interline distance calculation unit 453 also derives an index value for eliminating a sentence block including a sentence having an extremely long interline distance Sn. As this index value, the reciprocal (G) of the standard deviation between lines focusing on the standard deviation of the line distance Sn can be exemplified. This inverse line standard deviation (G) can be obtained by the following equation (3). For example, when the variation is large such as S1 = 6 (line) and S2 = 1 (line) as in the first sentence block B1, the standard deviation becomes large, so the value of the reciprocal number of interline standard deviation (G) taking the reciprocal number thereof. Becomes smaller.

ブロック選択部４５４は、上記のベクトルサイズ逆数（Ｊ）又は行間距離総乗逆数（Ｆ）と、行間標準偏差逆数（Ｇ）とに加え、付加的な判定指標値を参照して、複数の文章ブロックの中から最も抜粋文として相応しい文章を含む重要文章ブロックを選択する処理を行う。上記の付加的な判定指標値としては、例えば文章ブロック内における注目単語の出現頻度を示す注目単語密度（Ｈ）と、文章ブロック内における名詞単語の出現頻度を示す名詞密度（Ｋ）とを例示することができる。 The block selection unit 454 refers to an additional determination index value in addition to the above-described vector size reciprocal (J) or the inter-line distance sum reciprocal (F) and the inter-line standard deviation reciprocal (G), and adds a plurality of sentences. A process of selecting an important sentence block including a sentence suitable as the most extracted sentence from the blocks is performed. Examples of the additional determination index values include an attention word density (H) indicating the appearance frequency of the attention word in the sentence block and a noun density (K) indicating the appearance frequency of the noun word in the sentence block. can do.

注目単語密度（Ｈ）は、一つの文章ブロック内の全単語数（文書解析部４２で抽出された当該文書ブロック内の単語数）をＨ１、一つの文章ブロック内に出現する注目単語の数をＨ２とするとき、
Ｈ＝Ｈ２／Ｈ１・・・（４）
で求めることができる。注目単語密度（Ｈ）は、注目単語の出現頻度が高い程、高い値となる。一般に、注目単語の出現頻度が高い程、その注目単語に関して詳細な記述が為されているということができ、抜粋文としての適性が高いと評価できる。例えばベクトルサイズ逆数（Ｊ）が同値を示す文書ブロックが並存するとき等に、この注目単語密度（Ｈ）の値が大きい方（１行内に複数の注目単語が出現している方）が、重要文章ブロックと判定されることとなる。 The attention word density (H) is the total number of words in one sentence block (the number of words in the document block extracted by the document analysis unit 42) H1, and the number of attention words appearing in one sentence block. When H2
H = H2 / H1 (4)
Can be obtained. The attention word density (H) increases as the appearance frequency of the attention word increases. In general, it can be said that the higher the appearance frequency of the attention word, the more detailed description is made about the attention word, and the higher the suitability as an excerpt. For example, when document blocks having the same vector size reciprocal number (J) are present side by side, it is important that the value of this attention word density (H) is larger (one where a plurality of attention words appear in one line). It will be determined as a sentence block.

名詞密度（Ｋ）は、一つの文章ブロック内の全単語数をＫ１、一つの文章ブロック内に出現する名詞単語の数をＫ２とするとき、
Ｋ＝Ｋ２／Ｋ１・・・（５）
で求めることができる。一般に、形容詞、副詞、動詞等に相当する単語よりも、名詞単語の方が情報を伝達し易いと言うことができる。つまり、名詞単語が多く登場する文章を含む文章ブロックであるほど情報量が多く、抜粋文としての適性が高いと評価できる。従って、重要文章ブロックの特定要件として名詞単語の出現頻度は、一つの指標となり得る。 The noun density (K) is defined as K1 as the total number of words in one sentence block and K2 as the number of noun words appearing in one sentence block.
K = K2 / K1 (5)
Can be obtained. In general, it can be said that noun words are easier to convey information than words corresponding to adjectives, adverbs, verbs and the like. In other words, it can be evaluated that a sentence block including a sentence in which a large number of noun words appear has a large amount of information and is highly suitable as an excerpt sentence. Therefore, the frequency of appearance of noun words can be an index as a specific requirement for important sentence blocks.

ブロック選択部４５４は、上記（１）、（３）、（４）及び（５）式を乗算する下記（６）式にて、ブロック重要度Ｌを算出する。
Ｌ＝行間距離総乗逆数（Ｆ）×行間標準偏差逆数（Ｇ）×注目単語密度（Ｈ）×名詞密度（Ｋ）・・・（６）
若しくは、上記（２）、（３）、（４）及び（５）式を乗算する下記（７）式にて、ブロック重要度Ｌを算出する。
ベクトルサイズ逆数（Ｊ）×行間標準偏差逆数（Ｇ）×注目単語密度（Ｈ）×名詞密度（Ｋ）・・・（７）
さらにブロック選択部４５４は、各々の文書ブロックについてのブロック重要度Ｌを算出したら、その中からＬの値が最も高い文書ブロックを、重要文書ブロックとして特定する。 The block selection unit 454 calculates the block importance L by the following equation (6) that multiplies the above equations (1), (3), (4), and (5).
L = Linear distance reciprocal power (F) × Linear standard deviation reciprocal number (G) × Target word density (H) × Noun density (K) (6)
Alternatively, the block importance L is calculated by the following equation (7) that multiplies the equations (2), (3), (4), and (5).
Vector size reciprocal (J) x interline standard deviation reciprocal (G) x attention word density (H) x noun density (K) (7)
Further, after calculating the block importance L for each document block, the block selection unit 454 identifies the document block having the highest value of L as the important document block.

抜粋行選択部４５５は、ブロック選択部４５４で特定された重要文書ブロックの文章中から、注目単語を含む適宜な文章を、その注目単語が検索キーワードとして用いられた際に表示する抜粋文として抽出する処理を行う。この抽出は、例えば結果として重要文書ブロックの文章の全文であっても良いし、重要文書ブロックの文章中における注目単語を含む行のみを抜粋するものであっても良い。抽出された抜粋文は、文書ファイルの文書インデックス用単語（注目単語）に関連付けて記憶部４４に格納され、最終的には文書インデックステーブルに組み入れられて検索エンジン２０の文書インデックス記憶部２１に格納される。 The excerpt line selection unit 455 extracts an appropriate sentence including the attention word from the sentences of the important document block specified by the block selection part 454 as an excerpt sentence to be displayed when the attention word is used as a search keyword. Perform the process. This extraction may be, for example, the entire sentence of the important document block as a result, or only the line including the attention word in the sentence of the important document block may be extracted. The extracted excerpt is stored in the storage unit 44 in association with the document index word (attention word) of the document file, and finally incorporated in the document index table and stored in the document index storage unit 21 of the search engine 20. Is done.

図１０は、文書インデックステーブルの一例を示す表形式の図である。文書ファイルの別を表す文書番号毎に、検索インデックスの単語が抽出されており（ここでは単語Ａ、単語Ｂを例示）、これらの単語毎に重み（Ｄｉ）が付与されている。さらに、単語Ａについて、抜粋文とその文章が属する行番号が対応付けられている。記載は省略しているが、他の単語、他の文書ファイルについても同様である。従って、例えば単語Ａが検索キーワードとして与えられ、図１０に示す文書番号１００２の文書がヒットした場合には、その文書データにおける０００２行、０００８行及び０００９行の文書が、文書番号１００２の抜粋文として表示されることとなる。 FIG. 10 is a table format showing an example of the document index table. A word in the search index is extracted for each document number representing a document file (here, word A and word B are illustrated), and a weight (Di) is assigned to each word. Further, for the word A, the excerpt sentence is associated with the line number to which the sentence belongs. The description is omitted, but the same applies to other words and other document files. Therefore, for example, when the word A is given as a search keyword and the document with the document number 1002 shown in FIG. 10 is hit, the documents with 0002 lines, 0008 lines, and 0009 lines in the document data are extracted sentences with the document number 1002. Will be displayed.

なお、文書ファイルによっては、上記の手順で抜粋文を抽出できない場合が生じる。例えば、一つの文書データ中に注目単語を含む単位文章が１つ、又は２つしか登場しない場合等である。抜粋行選択部４５５は、単位文章が１つしか存在しない場合は、例えばその単位文章及びその前後単位文章を抜粋文として抽出する。また、単位文章が２つしか存在しない場合、行間距離Ｓｎとして抽出できるデータはＳ１のみとなるが、この場合は例えばＳ２を予め定めた固定値を代入するようにし、上記の算出が適用できるようにすることが望ましい。 Depending on the document file, there may be a case where the extracted sentence cannot be extracted by the above procedure. For example, there is a case where only one or two unit sentences including the word of interest appear in one document data. If there is only one unit sentence, the extracted line selection unit 455 extracts, for example, the unit sentence and the preceding and subsequent unit sentences as an extracted sentence. In addition, when there are only two unit sentences, only S1 can be extracted as the line spacing Sn. In this case, for example, S2 is substituted with a predetermined fixed value so that the above calculation can be applied. It is desirable to make it.

次に、本実施形態に係るネットワークシステムＳにおける検索動作を、図１１に示すフローチャートに基づいて説明する。端末装置３０から文書ファイル検索のためのクエリが入力されると（ステップＳ１）、このクエリはインターネットＩＮ又はローカルネットＬＮを介して検索エンジン２０で受信される（ステップＳ２）。そして、検索インデックス作成部は、与えられたクエリを文書解析して単語を抽出すると共に、所定の算術式で重み値（Ｓｉ）を求めて、図６に例示したような検索インデックスを作成する（ステップＳ３）。 Next, a search operation in the network system S according to the present embodiment will be described based on the flowchart shown in FIG. When a query for document file search is input from the terminal device 30 (step S1), the query is received by the search engine 20 via the Internet IN or the local net LN (step S2). Then, the search index creation unit analyzes the given query to extract words and obtains a weight value (Si) by a predetermined arithmetic expression to create a search index as illustrated in FIG. 6 ( Step S3).

続いて、検索処理部２３に備えられている検索アルゴリズムにより、ステップＳ３で作成された検索インデックスと、文書インデックス記憶部２１に予め格納されている文書インデックスとを用いて、各文書ファイルのクエリに対する合致度が計算される（ステップＳ４）。そして、合致度が高い文書ファイルが特定される（ステップＳ５）。さらに検索処理部２３は、文書インデックス中におけるクエリに最も関連深い単語である最関連単語を特定する（ステップＳ６）。この最関連単語は、検索インデックスの各単語の重み値（Ｓｉ）と、文書インデックスの各単語の重み値（Ｄｉ）（図１０参照）とを乗算し、その乗算値Ｓｉ×Ｄｉが最も高い値となる単語が選ばれる。 Subsequently, the search algorithm provided in the search processing unit 23 uses the search index created in step S3 and the document index stored in advance in the document index storage unit 21 to query each document file. The degree of match is calculated (step S4). Then, a document file having a high degree of match is specified (step S5). Further, the search processing unit 23 specifies the most relevant word that is the word most relevant to the query in the document index (step S6). This most relevant word is multiplied by the weight value (Si) of each word in the search index and the weight value (Di) (see FIG. 10) of each word in the document index, and the multiplication value Si × Di is the highest value. Is selected.

そして、最関連単語に対応付けて記憶されている抜粋文が抽出される（ステップＳ７）。例えば図６、図１０の例を用いるならば、単語Ａについての乗算値Ｓｉ×Ｄｉが文書番号１００２の文書ファイルにおいて最も高い値である場合、当該文書ファイルの抜粋文として、単語Ａについての抜粋文が選ばれる。 Then, the excerpt sentence stored in association with the most relevant word is extracted (step S7). For example, if the examples of FIGS. 6 and 10 are used, if the multiplication value Si × Di for the word A is the highest value in the document file with the document number 1002, the excerpt for the word A is used as the excerpt sentence for the document file. A sentence is chosen.

その後、ランキング表示処理部２４により、検索処理でヒットした複数の文書ファイルを、クエリに対する合致度合いが高い順に順位付けしたリストが作成される（ステップＳ８）。このリストには、文書ファイルのタイトルと、上記ステップＳ７で抽出された抜粋文とが含まれる。そして、この検索結果リストは、検索エンジン２０から、クエリが入力された端末装置３０へ送信される（ステップＳ９）。 Thereafter, the ranking display processing unit 24 creates a list in which a plurality of document files hit in the search process are ranked in descending order of matching degree with the query (step S8). This list includes the title of the document file and the excerpt sentence extracted in step S7. Then, the search result list is transmitted from the search engine 20 to the terminal device 30 to which the query is input (step S9).

続いて、データ処理装置４０による抜粋文抽出処理について、図１２に示すフローチャートに基づいて説明する。抜粋文抽出部４５は、処理対象とする文書ファイル群の各々に付与されているナンバリングに対応するカウンタｐを“０”に設定した上で（ステップＳ１１）、ｐ＝ｐ＋１にカウンタを進める（ステップＳ１２）。そして、ｐ番目の文書ファイル（初回は、１番のナンバーが付与されている文書ファイル）が抽出される（ステップＳ１３）。 Next, the excerpt sentence extraction process by the data processing device 40 will be described based on the flowchart shown in FIG. The excerpt sentence extraction unit 45 sets the counter p corresponding to the numbering assigned to each document file group to be processed to “0” (step S11), and advances the counter to p = p + 1 (step S11). S12). Then, the p-th document file (the document file to which the first number is assigned for the first time) is extracted (step S13).

続いて、注目単語設定部４５１により注目単語が設定される。ｐ番目の文書ファイルに対して抽出されている文書インデックス用の単語群の各々に付与されているナンバリングに対応するカウンタｑを“０”に設定した上で（ステップＳ１４）、ｑ＝ｑ＋１にカウンタを進める（ステップＳ１５）。これにより、ｑ番目の単語が注目単語として抽出され、この注目単語についてブロック分割部４５２により、文書データを複数の文章ブロックに分割する処理が実行される（ステップＳ１６／図７、図８参照）。 Subsequently, the attention word setting unit 451 sets the attention word. A counter q corresponding to the numbering assigned to each word group for document index extracted for the p-th document file is set to “0” (step S14), and the counter is set to q = q + 1. (Step S15). Thereby, the q-th word is extracted as the attention word, and the process of dividing the document data into a plurality of sentence blocks is executed by the block dividing unit 452 for the attention word (see step S16 / FIG. 7, FIG. 8). .

次に、上記文章ブロックの各々に付与されているナンバリングに対応するカウンタｉが“０”に設定された上で（ステップＳ１７）、ｉ＝ｉ＋１にカウンタが進められる（ステップＳ１８）。そして、ｉ番目の文章ブロックについて、行間距離算出部４５３により行間距離総乗の逆数又はベクトルサイズの逆数、及び行間標準偏差逆数が算出され、またブロック選択部４５４により注目単語密度及び名詞密度が計算され、上記の（６）式又は（７）式によりｉ番目の文章ブロックの重要度Ｌが求められる（ステップＳ１９）。この重要度Ｌの値は、記憶部４４に一時的に記憶される（ステップＳ２０）。 Next, after the counter i corresponding to the numbering assigned to each of the sentence blocks is set to “0” (step S17), the counter is advanced to i = i + 1 (step S18). Then, for the i-th sentence block, the reciprocal of the inter-line distance sum or the reciprocal of the vector size and the reciprocal of the standard deviation between lines are calculated by the inter-line distance calculation unit 453, and the attention word density and the noun density are calculated by the block selection unit 454. Then, the importance level L of the i-th sentence block is obtained by the above formula (6) or formula (7) (step S19). The value of importance L is temporarily stored in the storage unit 44 (step S20).

その後、重要度Ｌを求める文章ブロックが最終であるか否かが確認され（ステップＳ２１）、最終でない場合は（ステップＳ２１でＮＯ）、ステップＳ１８に戻って次の文章ブロックの重要度Ｌを求める処理が繰り返される。一方、最終でない場合は（ステップＳ２１でＹＥＳ）、ブロック選択部４５４は、重要度Ｌが最も高い重要文章ブロックを選択し、抜粋行選択部４５５がその重要文章ブロックから抜粋文（行）を特定する（ステップＳ２２）。この抜粋文は、文書インデックス記憶部２１に格納される。 Thereafter, it is confirmed whether or not the sentence block for which importance L is obtained is final (step S21). If it is not final (NO in step S21), the process returns to step S18 to obtain importance L for the next sentence block. The process is repeated. On the other hand, if it is not final (YES in step S21), the block selection unit 454 selects the important sentence block having the highest importance L, and the excerpt line selection unit 455 identifies the excerpt sentence (line) from the important sentence block. (Step S22). This excerpt is stored in the document index storage unit 21.

続いて、単語群のカウンタｑが最終であるか否かが確認され（ステップＳ２３）、最終でない場合は（ステップＳ２３でＮＯ）、ステップＳ１５に戻って次の注目単語について、文書ブロックの分割処理及び文章ブロックの重要度Ｌを求める処理が繰り返される。一方、カウンタｑが最終である場合（ステップＳ２３でＹＥＳ）、文書ファイルのカウンタｐが最終であるか否かが確認される（ステップＳ２４）。カウンタｐが最終でない場合は（ステップＳ２４でＮＯ）、ステップＳ１２に戻って次の文書ファイルについて同じ処理が繰り返される。一方、カウンタｐが最終である場合（ステップＳ２４でＹＥＳ）、処理を終える。 Subsequently, it is confirmed whether or not the word group counter q is final (step S23). If it is not final (NO in step S23), the process returns to step S15 to divide the document block for the next attention word. And the process for obtaining the importance L of the sentence block is repeated. On the other hand, if the counter q is final (YES in step S23), it is confirmed whether or not the document file counter p is final (step S24). If the counter p is not final (NO in step S24), the process returns to step S12 and the same processing is repeated for the next document file. On the other hand, if the counter p is final (YES in step S24), the process ends.

以上説明した本実施形態に係るネットワークシステムＳによれば、データベース１０に対する文書ファイルの検索処理において、検索クエリに対応した文書ファイルの要点部分を的確に抜粋文として抽出することができる。従って、ユーザが、検索ヒットリストから必要な情報を的確に得ることができる検索システムを提供することができる。 According to the network system S according to the present embodiment described above, the main part of the document file corresponding to the search query can be accurately extracted as an excerpt in the document file search process for the database 10. Therefore, it is possible to provide a search system that allows the user to accurately obtain necessary information from the search hit list.

ＳネットワークシステムＳ
１０データベース
２０検索エンジン
２１文書インデックス記憶部
２２検索インデックス作成部
２３検索処理部
２４ランキング表示処理部
３０端末装置
４０データ処理装置
４５抜粋文抽出部
４５１注目単語設定部
４５２ブロック分割部
４５３行間距離算出部
４５４ブロック選択部
４５５抜粋行選択部
S Network system S
DESCRIPTION OF SYMBOLS 10 Database 20 Search engine 21 Document index memory | storage part 22 Search index creation part 23 Search processing part 24 Ranking display processing part 30 Terminal device 40 Data processing device 45 Extract sentence extraction part 451 Attention word setting part 452 Block division part 453 Interline distance calculation part 454 Block selection part 455 Extraction line selection part

Claims

The excerpt in the case where a search process of a document file in a database is performed by giving a predetermined query, and a part of a sentence included in the document file matching the query is extracted and displayed as a search result A method for extracting sentences,
Determining one word of interest from a group of words included in the document data in the document file;
The document data of one document file is treated as being divided into a plurality of unit sentences including a predetermined character group, and an area including a plurality of unit sentences including the attention word in a series of sentences is defined as the document. Defining as sentence blocks in the data;
In each of the sentence blocks, evaluating an appearance interval of the unit sentence in the series of sentences, and identifying an important sentence block using the shortness of the appearance interval as an index;
Extracting a sentence including the attention word from the important sentence block as the excerpt sentence;
A method for extracting excerpts characterized by including:

The step of defining as the sentence block includes:
Among the plurality of unit sentences, a first unit sentence in which the attention word appears first in the document data, a second unit sentence in which the attention word appears second, and the attention word is the first A third unit sentence that appears third is specified, and a sentence area from the first unit sentence to the third unit sentence is defined as a first sentence block;
A fourth unit sentence in which the attention word appears fourth in the document data is specified, and a sentence region from the second unit sentence to the fourth unit sentence is defined as a second sentence block. The excerpt sentence extraction method according to claim 1, wherein:

The unit sentence is one line of sentences,
3. The excerpt sentence extraction method according to claim 1, wherein an appearance interval of the unit sentence is evaluated by a length between lines.

In the step of identifying the important sentence block,
The excerpt sentence extraction method according to any one of claims 1 to 3, wherein the appearance frequency of the attention word is set as an evaluation target, and the important sentence block is specified by using the appearance frequency as an index. .

In the step of identifying the important sentence block,
5. The important sentence block is identified by using the appearance frequency of the noun word as an evaluation target and specifying the frequency of appearance of the noun word in the unit sentence including the attention word as an index. Excerpt sentence extraction method.