JP2012003697A

JP2012003697A - Snippet generation method, snippet generation device, snippet generation program and recording medium

Info

Publication number: JP2012003697A
Application number: JP2010140649A
Authority: JP
Inventors: Atsuyuki Goto; 淳之後藤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2010-06-21
Filing date: 2010-06-21
Publication date: 2012-01-05

Abstract

PROBLEM TO BE SOLVED: To provide a method for dynamically creating a snippet depending on a given text.SOLUTION: In a snippet generation device 11, a query analysis section 14 executes a morpheme analysis on a given text to break down the text into morphemes, and converts a pattern considering word conjugation of each morpheme into a normal expression; a key text selection section 16 searches the pattern on the object text, measures the frequency of the normal expression used, calculates a score of texts constituting the document based on the frequency of the normal expression used in the document and the frequency of the document used, and selects upper n texts in order from the higher score; and a text compaction section 17 deletes an unnecessary portion from the n texts into specified number of characters to generate a snippet.

Description

本発明は、文章から当該文章の概要を表す要約文を生成するスニペット生成装置に関する。 The present invention relates to a snippet generation device that generates a summary sentence representing an outline of a sentence from the sentence.

従来のスニペット生成装置としては、Ｇｏｏｇｌｅ社、Ｙａｈｏｏ！、Ｍｉｃｒｏｓｏｆｔ社のＢｉｎｇ、等のインターネット検索サービスにおいて、ＷｅｂサイトとＷｅｂサイトの概要を表すスニペットを付与して検索結果を提供することが知られている。
まず、Ｇｏｏｇｌｅ検索では、基本的に検索対象となるＷｅｂサイトのＨＴＭＬ文書のmeta descriptionタグを使用してスニペットを生成している（クエリ非依存）。
このＧｏｏｇｌｅ検索について、特許文献１には、ファクトのソースを表示する方法として、１つ以上の用語を含むファクトクエリを受け取り、１つ以上の用語を含む、ファクトクエリへの返答を特定し、ファクトクエリの１つ以上の用語および返答の１つ以上の用語を含むソースドキュメントを特定し、ソースドキュメントの少なくとも１つのスニペットを生成することであって、スニペットは、ファクトクエリの１つ以上の用語および返答の１つ以上の用語を含み、かつ、スニペットを含む応答を生成する方法が開示されている。 Conventional snippet generators include Google, Yahoo! In Internet search services such as Microsoft's Bing, it is known to provide a search result by adding a snippet representing a website and an outline of the website.
First, in Google search, a snippet is basically generated using a meta description tag of an HTML document of a Web site to be searched (query-independent).
Regarding this Google search, Patent Document 1 discloses a fact query that includes one or more terms as a method for displaying the source of the fact, identifies a response to the fact query that includes one or more terms, and the facts. Identifying a source document that includes one or more terms in a query and one or more terms in a response and generating at least one snippet of the source document, wherein the snippet includes one or more terms in a fact query and A method for generating a response that includes one or more terms of a response and that includes a snippet is disclosed.

一方、Ｙａｈｏｏ！検索では、クエリ依存とクエリ非依存のスニペット生成方法を組み合わせるハイブリッド方式を採用している。
詳しくは、Ｙａｈｏｏ！検索では、
１．文書を構成する単語もしくは固有名詞の出現頻度と文の出現位置を加味し、文書を構成する文にスコア付け（クエリ非依存のスニペット生成）し、Ｆ_Iとする。
２．本発明書のスニペット生成方法と同じように検索キーワードの出現頻度を加味し、文書を構成する文にスコア付け（クエリ依存）し、Ｆ_Qとする。
３．クエリ文を解析しクエリ意図を０以上１以下の範囲で数値化αし、αＦ_I＋（１−α）Ｆ_Qで文書を構成する文のスコア付けを行うことで、スニペットを生成する。
上記３において、明示的な記述がないがクエリ意図を数値化するには、外部データ・知識を利用する必要がある。
米国Ｙａｈｏｏ！のスニペット生成方法は、Ｇｏｏｇｌｅ方式と異なり検索対象文書側でスニペット生成のヒント（meta descriptionタグ）を用意する必要がない。 On the other hand, Yahoo! Search uses a hybrid method that combines query-dependent and query-independent snippet generation methods.
Specifically, Yahoo! In search,
1. Considering the occurrence position of the frequency and sentence of words or proper names constituting the document, scoring sentences of a document by (query-independent snippet generation), and F _I.
2. In the same way as the snippet generation method of the present invention, the appearance frequency of the search keyword is taken into consideration, the sentences constituting the document are scored (query-dependent), and _FQ is obtained.
3. The snippet is generated by analyzing the query sentence, digitizing α the query intention in the range of 0 to 1, and scoring the sentence constituting the document with αF _I + (1−α) F _Q.
In the above 3, there is no explicit description, but in order to quantify the query intention, it is necessary to use external data and knowledge.
USA Yahoo! In the snippet generation method, unlike the Google method, it is not necessary to prepare a snippet generation hint (meta description tag) on the search target document side.

Ｍｉｃｒｏｓｏｆｔ社のＢｉｎｇも具体的な実装は異なるがハイブリッド方式を採用している。
関連文書として、米国Ｙａｈｏｏ！のスニペット生成方法（特許文献２）が公開されている。
また、米国Ｙａｈｏｏ方式のクエリに依存したスニペット生成方法では、対象文章を構成する文が、
ａ．何パーセントの検索語を含むか
ｂ．何個の検索語を含むか
ｃ．部分文字列としてクエリ文を含むか
を計測・検査し、ａ〜ｃスコアを総合評価することでスコア付けを行う。
非特許文献１には、形態素辞書の単語生起コストを利用し、単語生起コストが高い単語からキーワードを抽出することについて開示されている。 Microsoft's Bing also uses a hybrid system, although the specific implementation is different.
As a related document, USA Yahoo! A snippet generation method (Patent Document 2) is disclosed.
In addition, in the snippet generation method that relies on the US Yahoo query, the sentences constituting the target sentence are:
a. What percentage of search terms are included b. How many search terms are included c. Whether a query statement is included as a partial character string is measured / inspected, and scoring is performed by comprehensively evaluating the ac scores.
Non-Patent Document 1 discloses extracting a keyword from a word having a high word occurrence cost by using the word occurrence cost of the morpheme dictionary.

特許文献３には、文書データベースに格納されている文書の全文を対象として、検索条件と合致する文書を検索する文書検索装置として、文書データベース中の全文書に対して、文書中の全文を読みに変換して全文索引を生成し、検索条件を入力し、検索条件を形態素解析して単語を抽出し、この単語の読みに対する語幹と検索条件式から初期検索条件式を生成し、抽出した単語の形態素解析結果を用いて、該単語の読みに対する変化形と検索条件式から絞込検索条件式を生成し、全文索引を参照して、初期検索条件式または絞込検索条件式を適用して文書データベースの検索を行い、検索実行では、初期検索条件式を渡された場合には、全文索引を参照して、１次検索を行って中間結果を作成し、絞込検索条件式を渡された場合には、中間結果の文書に対して全文検索を行うようにした文書検索装置が開示されている。 In Patent Document 3, as a document search device that searches for a document that matches a search condition for the entire document stored in a document database, the entire document is read from all documents in the document database. To generate a full-text index, enter search conditions, extract words by morphological analysis of the search conditions, generate initial search condition expressions from stems and search condition expressions for reading this word, and extract the extracted words Using the morpheme analysis result of, generate a refined search condition expression from the change to the word reading and the search condition expression, refer to the full-text index, and apply the initial search condition expression or the refinement search condition expression When searching the document database and the initial search condition expression is passed in the search execution, refer to the full-text index, perform a primary search, create an intermediate result, and pass the narrow search condition expression In the case of Document search apparatus is disclosed in which to perform a full-text search for documents of fruit.

特許文献４には、ユーザが入力した質問文から対象用語と、求められている説明の用語に対する意味的役割を抽出し、対象用語を検索キーとして対象用語に関係が深いＷｅｂページをインターネット上で検索し獲得し、獲得した複数のＷｅｂページから、用語にリンク先が付加されたパターン、定義型リスト形式もしくはボールドタグなどにより用語が見出し化されたパターン、用語が文で説明されたパターン、用語が連体節で説明されたパターンそれぞれを、対象用語に関する説明部分として抽出し、抽出された説明部分の対象用語に対する意味的役割を判定し、抽出された説明部分のうち抽出された意味的役割と同じ意味的役割を持つ説明部分から返答文を生成して質問文に対し応答し、抽出された全ての説明集合に出現する単語が全ての説明集合中で出現した説明の数で構成した全ての説明集合を特徴付けるベクトルと、各説明に単語が出現したか否かを判定する関数で構成した各説明を特徴付けるベクトルとの内積を正規化して各説明の信頼値を評価する質問応答装置が開示されている。 In Patent Literature 4, a semantic role for a target term and a term for explanation that is requested is extracted from a question sentence inputted by a user, and a Web page closely related to the target term is searched on the Internet using the target term as a search key. A pattern in which a link destination is added to a term, a pattern in which the term is headed by a definition list format or a bold tag, etc., a pattern in which the term is explained in sentences, a term Is extracted as an explanation part related to the target term, a semantic role of the extracted explanation part with respect to the target term is determined, and the extracted semantic part is extracted from the extracted explanation part. Generate a response sentence from explanation parts with the same semantic role, respond to the question sentence, and all words appearing in all extracted explanation sets Normalize the inner product of a vector characterizing all explanation sets composed of the number of explanations that appear in the explanation set and a vector characterizing each explanation composed of a function that determines whether a word appears in each explanation. A question answering apparatus for evaluating the reliability value of each explanation is disclosed.

従来のスニペット生成方法では、名詞や名詞句等の内容語の出現頻度に基づいて、文書の要約文を生成していた。
しかしながら、従来のスニペット生成方法にあっては、予め検索対象となる文書側でスニペット（またはヒント）を用意する必要あるという問題があった。
そこで、予め検索対象となる文書側でスニペットを用意することなく、スニペットを生成することが切望されている。
本発明は、与えられた文章に依存してスニペットを動的に生成する方法を提供することを目的とする。 In the conventional snippet generation method, a summary sentence of a document is generated based on the appearance frequency of content words such as nouns and noun phrases.
However, the conventional snippet generation method has a problem that a snippet (or hint) needs to be prepared in advance on the document side to be searched.
Therefore, it is desired to generate a snippet without preparing a snippet on the document side to be searched in advance.
An object of the present invention is to provide a method for dynamically generating a snippet depending on a given sentence.

上記課題を解決するために、請求項１記載の発明は、文章と関連する文書が入力された場合、文書から文章と関連する部分を抜き出し、指定された文字数の要約文を生成する方法であって、与えられた文章を形態素解析して形態素に分解し、各形態素の活用語形を考慮したパターンを正規表現に変換する文解析ステップと、対象となる文書に対しパターン検索を行い、正規表現の出現頻度を計測し、正規表現の文書内出現頻度と文書出現頻度とに基づいて文書を構成する文のスコアを計算し、スコアの高い順に上位ｎ個の文を選択する重要文選択ステップと、ｎ個の文から指定された文字数になるように不必要な部分を削除してスニペットとする文縮約ステップと、を有することを特徴とする。
また、請求項２記載の発明は、前記重要文選択ステップは、正規表現の文書内出現頻度と文書出現頻度とに基づいて、検索語の重みを計算するステップと、検索語の重みを用いて文書を構成する文の重みを計算するステップと、対象文書から重みが大きい順から指定文字数を超えるまで文を選択するステップと、を有する請求項１記載のスニペット生成方法を特徴とする。 In order to solve the above-mentioned problem, the invention described in claim 1 is a method of generating a summary sentence of a designated number of characters by extracting a part related to a sentence from the document when a document related to the sentence is inputted. Parse the given sentence into morphemes, decompose them into morphemes, convert the patterns that take into account the word forms of each morpheme into regular expressions, perform a pattern search on the target document, An important sentence selection step of measuring the appearance frequency, calculating a score of a sentence constituting the document based on the appearance frequency of the regular expression in the document and the document appearance frequency, and selecting the top n sentences in descending order of the score; and a sentence contraction step that deletes unnecessary parts from n sentences so as to have a specified number of characters, thereby forming a snippet.
In the invention according to claim 2, the important sentence selecting step uses a step of calculating a weight of a search word based on an appearance frequency of the regular expression in the document and a document appearance frequency, and using the weight of the search word The snippet generation method according to claim 1, further comprising: calculating a weight of a sentence constituting the document; and selecting a sentence from the target document in descending order of weight until a specified number of characters is exceeded.

また、請求項３記載の発明は、前記文縮約ステップは、最も重要でない部分から１文字を漸次削除していく処理を繰り返すようにして、指定文字数にスニペット長を調整するステップを有する請求項１記載のスニペット生成方法を特徴とする。
また、請求項４記載の発明は、前記文解析ステップの前記形態素解析においては、辞書として言語辞書のみを用いる請求項１記載のスニペット生成方法を特徴とする。
また、請求項５記載の発明は、文章と関連する文書が与えられた場合、文書から文章と関連する部分を抜き出し、指定された文字数の要約文を生成する装置であって、与えられた文章を形態素解析して形態素に分解し、各形態素の活用語形を考慮したパターンを正規表現に変換する文解析手段と、対象となる文書に対しパターン検索を行い、正規表現の出現頻度を計測し、正規表現の文書内出現頻度と文書出現頻度とに基づいて文書を構成する文のスコアを計算し、スコアの高い順に上位ｎ個の文を選択する重要文選択手段と、ｎ個の文から指定された文字数になるように不必要な部分を削除してスニペットとする文縮約手段と、を備えることを特徴とする。 Further, in the invention according to claim 3, the sentence contracting step includes a step of adjusting the snippet length to the designated number of characters by repeating the process of gradually deleting one character from the least important part. The snippet generation method according to 1, is characterized.
The invention according to claim 4 is characterized in that the snippet generation method according to claim 1 uses only a language dictionary as a dictionary in the morphological analysis of the sentence analysis step.
The invention according to claim 5 is an apparatus for extracting a portion related to a sentence from the document and generating a summary sentence of a specified number of characters when a document related to the sentence is given. Analyzing the morpheme into morphemes, sentence analysis means that converts the patterns that take into account the word forms of each morpheme into regular expressions, pattern search for the target document, and measuring the frequency of appearance of regular expressions, An important sentence selection means for calculating a score of a sentence constituting the document based on the occurrence frequency of the regular expression in the document and the document occurrence frequency, and selecting from the n sentences, the important sentence selecting means for selecting the top n sentences in descending order of the score And sentence contraction means for deleting unnecessary parts to make a snippet so that the number of characters becomes the same.

また、請求項６記載の発明は、文章と関連する文書が入力された場合、文書から文章と関連する部分を抜き出し、指定された文字数の要約文を生成するためのコンピュータ・プログラムであって、与えられた文章を形態素解析して形態素に分解し、各形態素の活用語形を考慮したパターンを正規表現に変換する文解析ステップと、対象となる文書に対しパターン検索を行い、正規表現の出現頻度を計測し、正規表現の文書内出現頻度と文書出現頻度とに基づいて文書を構成する文のスコアを計算し、スコアの高い順に上位ｎ個の文を選択する重要文選択ステップと、ｎ個の文から指定された文字数になるように不必要な部分を削除してスニペットとする文縮約ステップと、をコンピュータに実行させることを特徴とする。 The invention according to claim 6 is a computer program for extracting a portion related to a sentence from the document and generating a summary sentence of a designated number of characters when a document related to the sentence is input, Analyzes the given sentence into morphemes, decomposes them into morphemes, converts the patterns that take into account the word forms of each morpheme into regular expressions, performs pattern searches for the target document, and the frequency of regular expressions An important sentence selection step of calculating the score of sentences composing the document based on the appearance frequency of the regular expression in the document and the document appearance frequency, and selecting the top n sentences in descending order of the score; And a sentence reduction step of making a snippet by deleting unnecessary parts from the sentence so as to have a specified number of characters.

また、請求項７記載の発明は、文章と関連する文書が入力された場合、文書から文章と関連する部分を抜き出し、指定された文字数の要約文を生成するためのコンピュータ・プログラムが記録されている記録媒体であって、与えられた文章を形態素解析して形態素に分解し、各形態素の活用語形を考慮したパターンを正規表現に変換する文解析ステップと、対象となる文書に対しパターン検索を行い、正規表現の出現頻度を計測し、正規表現の文書内出現頻度と文書出現頻度とに基づいて文書を構成する文のスコアを計算し、スコアの高い順に上位ｎ個の文を選択する重要文選択ステップと、ｎ個の文から指定された文字数になるように不必要な部分を削除してスニペットとする文縮約ステップと、をコンピュータに処理させることを特徴とするスニペット生成プログラムが記録されていることを特徴とする。 According to the seventh aspect of the present invention, when a document related to a sentence is input, a computer program for extracting a part related to the sentence from the document and generating a summary sentence with a specified number of characters is recorded. A sentence analysis step that parses a given sentence into morphemes and converts the patterns that take into account the word forms of each morpheme into regular expressions, and performs a pattern search for the target document. Important to calculate the appearance frequency of regular expressions, calculate the score of sentences constituting the document based on the appearance frequency of the regular expression in the document and the document appearance frequency, and select the top n sentences in descending order of score A sentence selection step, and a sentence reduction step in which unnecessary parts are deleted from the n sentences so as to have a specified number of characters, and a snippet is processed. Snippet generator is characterized in that it is recorded that.

本発明のスニペット生成方法によれば、クエリ文中の検索語を正規表現にすることにより、１回の文書走査で検索語の活用形も考慮した重要文を選択することができ、与えられた文章に依存して、スニペットを動的に生成することができる。この際に、予め検索対象となる文書側でスニペットまたはヒントを用意する必要はない。 According to the snippet generation method of the present invention, by using a search term in a query sentence as a regular expression, an important sentence can be selected in consideration of the utilization form of the search word in one document scan, and a given sentence Depending on the snippet can be generated dynamically. At this time, it is not necessary to prepare a snippet or a hint in advance on the document side to be searched.

本発明の実施形態に係るスニペット生成装置１１の構成について説明するためのブロック図。The block diagram for demonstrating the structure of the snippet production | generation apparatus 11 which concerns on embodiment of this invention. 本発明の実施形態に係るスニペット生成装置１１の動作概要について説明するためのメインのフローチャート。The main flowchart for demonstrating the operation | movement outline | summary of the snippet production | generation apparatus 11 which concerns on embodiment of this invention. 本発明の実施形態に係るスニペット生成装置１１における重要文選択処理について説明するためのサブルーチンのフローチャート（その１）。The flowchart (the 1) of the subroutine for demonstrating the important sentence selection process in the snippet production | generation apparatus 11 which concerns on embodiment of this invention. 本発明の実施形態に係るスニペット生成装置１１における重要文選択処理について説明するためのサブルーチンのフローチャート（その２）。The flowchart (the 2) of the subroutine for demonstrating the important sentence selection process in the snippet production | generation apparatus 11 which concerns on embodiment of this invention. 本発明の実施形態に係るスニペット生成装置１１における文縮約処理について説明するためのサブルーチンのフローチャート。The flowchart of the subroutine for demonstrating the sentence reduction process in the snippet production | generation apparatus 11 which concerns on embodiment of this invention.

以下に、図面を参照して発明の実施の形態に係るスニペット生成装置の構成例について説明する。
図１は、本発明の実施形態に係るスニペット生成装置１１の構成を説明する図である。
スニペット生成装置１１は、入出力装置１２に接続されている。入出力装置１２は外部からクエリ文を入力し、入出力装置１２を介してクエリ解析部１４に出力する。
スニペット生成ブロック１３は、クエリ解析部１４およびスニペット生成部１５を備え、スニペット生成部１５内には重要文選択部１６および文縮約部１７を有している。クエリ解析部１４は、形態素解析処理により形態素に分解し、検索語として有効な文字列を抽出する。クエリ解析部１４には他方に検索装置１８が接続されている。
検索装置１８は、検索処理、追加処理、更新処理などの文書データベース（以下、文書ＤＢという）１９への管理運用を行う。例えば、検索装置１８は、クエリ解析部１４からの検索語に基づいて文書ＤＢ１９の検索を行い、検索装置１８は検索結果を検索結果取得部２０に出力する。文書ＤＢ１９は、文書データをデータベース形式で記録している。
検索結果取得部２０は、文書ＤＢ１９から検索装置１８を介して取得した検索語に対応する文書データをスニペット生成部１５に出力する。
スニペット生成部１５は、上述したように重要文選択部１６および文縮約部１７を備えており、重要文選択部１６は、検索結果取得部２０から文書データを受け取りつつ（Ｔ４）、正規表現パターンで内容をスキャンし、検索語の出現頻度と文の区切りを計測する。
文縮約部１７は、選択した文を文縮約部１７に渡し、処理手順に従って指定文字数まで文字を削除するとともに、スニペットを付与した検索結果一覧を入出力装置１２に出力する。 Hereinafter, a configuration example of a snippet generation device according to an embodiment of the invention will be described with reference to the drawings.
FIG. 1 is a diagram illustrating a configuration of a snippet generation device 11 according to an embodiment of the present invention.
The snippet generation device 11 is connected to the input / output device 12. The input / output device 12 inputs a query sentence from the outside and outputs it to the query analysis unit 14 via the input / output device 12.
The snippet generation block 13 includes a query analysis unit 14 and a snippet generation unit 15. The snippet generation unit 15 includes an important sentence selection unit 16 and a sentence contraction unit 17. The query analysis unit 14 decomposes into morphemes by morphological analysis processing, and extracts a character string that is effective as a search term. The query analysis unit 14 is connected to a search device 18 on the other side.
The search device 18 performs management operations on a document database (hereinafter referred to as a document DB) 19 such as search processing, addition processing, and update processing. For example, the search device 18 searches the document DB 19 based on the search word from the query analysis unit 14, and the search device 18 outputs the search result to the search result acquisition unit 20. The document DB 19 records document data in a database format.
The search result acquisition unit 20 outputs document data corresponding to the search word acquired from the document DB 19 via the search device 18 to the snippet generation unit 15.
The snippet generation unit 15 includes the important sentence selection unit 16 and the sentence contraction unit 17 as described above, and the important sentence selection unit 16 receives the document data from the search result acquisition unit 20 (T4) and performs regular expression. Scan the content with patterns and measure the frequency of occurrence of search terms and sentence breaks.
The sentence contraction unit 17 passes the selected sentence to the sentence contraction unit 17, deletes characters up to the designated number of characters according to the processing procedure, and outputs a search result list with a snippet to the input / output device 12.

次に、図１を参照して、本発明の実施形態に係るスニペット生成装置１１の基本的な動作について説明する。なお、図１に示すＴ１〜Ｔ６は処理手順の順番を示している。
まず、入出力装置１２にクエリ文を入力する（Ｔ１）。このクエリ文は入出力装置１２を介してクエリ解析部１４に入力される。クエリ解析部１４では、形態素解析処理により形態素に分解し、検索語として有効な文字列を抽出する。
検索語は、クエリ解析部１４から検索装置１８に入力される（Ｔ２）。検索装置１８は、検索語に基づいて文書ＤＢ１９の検索を行い、検索装置１８は検索結果を検索結果取得部２０に出力する（Ｔ３）。
一方、クエリ解析部１４では、各検索語については活用形を考慮した正規表現パターンで表し、それらの和集合を最終の正規表現パターンとしてスニペット生成部１５に渡す（Ｔ５）。
スニペット生成部１５に設けられた重要文選択部１６は、検索結果取得部２０から文書データを受け取りつつ（Ｔ４）、正規表現パターンで内容をスキャンし、検索語の出現頻度と文の区切りを計測する。
重要文選択部１６は、文の重み計算を下記式１に従って行い、計算結果を降順にソートし、先頭から順に指定文字数を超えるまで文を選択する。重要文選択部１６は、選択した文を文縮約部１７に渡す。
文縮約部１７は、後述する処理ステップＢ１１〜Ｂ４に従って指定文字数まで文字を削除し、最終的にスニペットを付与した検索結果一覧として入出力装置１２を介して出力する（Ｔ６）。 Next, a basic operation of the snippet generation device 11 according to the embodiment of the present invention will be described with reference to FIG. In addition, T1-T6 shown in FIG. 1 has shown the order of the processing procedure.
First, a query sentence is input to the input / output device 12 (T1). This query sentence is input to the query analysis unit 14 via the input / output device 12. The query analysis unit 14 decomposes into morphemes by morpheme analysis processing, and extracts a character string that is effective as a search term.
The search term is input from the query analysis unit 14 to the search device 18 (T2). The search device 18 searches the document DB 19 based on the search word, and the search device 18 outputs the search result to the search result acquisition unit 20 (T3).
On the other hand, the query analysis unit 14 represents each search term with a regular expression pattern that takes into account the utilization form, and passes the union thereof to the snippet generation unit 15 as the final regular expression pattern (T5).
The important sentence selection unit 16 provided in the snippet generation unit 15 receives the document data from the search result acquisition unit 20 (T4), scans the contents with the regular expression pattern, and measures the appearance frequency of the search word and the sentence break. To do.
The important sentence selection unit 16 performs sentence weight calculation according to the following formula 1, sorts the calculation results in descending order, and selects sentences from the beginning until the specified number of characters is exceeded. The important sentence selection unit 16 passes the selected sentence to the sentence contraction unit 17.
The sentence contracting unit 17 deletes characters up to the designated number of characters according to processing steps B11 to B4 described later, and finally outputs the result as a search result list to which a snippet is added via the input / output device 12 (T6).

上述した説明では、本発明の実施形態に係るスニペット生成装置１１の基本的な動作について説明したが、図１に示すスニペット生成ブロック１３に備えられたクエリ解析部１４、重要文選択部１６および文縮約部１７の動作について、図２〜図５を参照して、さらに詳しく説明する。
なお、スニペット生成ブロック１３は、ＣＰＵ、ＲＯＭ、ＲＡＭから構成されてもよく、このＲＯＭには、クエリ解析部１４、重要文選択部１６および文縮約部１７の具体例として各部がソフトウエアモジュールにより構成されていてもよい。この場合、ＣＰＵはＲＯＭに記憶されているソフトウエアモジュールを読み出して順番に実行することとする。 In the above description, the basic operation of the snippet generation device 11 according to the embodiment of the present invention has been described. However, the query analysis unit 14, the important sentence selection unit 16, and the sentence included in the snippet generation block 13 illustrated in FIG. The operation of the contracting unit 17 will be described in more detail with reference to FIGS.
The snippet generation block 13 may be composed of a CPU, a ROM, and a RAM. In this ROM, each part is a software module as a specific example of the query analysis unit 14, the important sentence selection unit 16, and the sentence contraction unit 17. It may be constituted by. In this case, the CPU reads out the software modules stored in the ROM and executes them in order.

次に、図２に示すメインのフローチャートを参照して、検索システムにおけるスニペット生成処理の概要について説明する。
まず、ステップＳ１では、クエリ解析部１４でクエリ文解析処理を行う。コンピュータのモニタ上に描画されている検索ボックスに対してテキストからなるクエリ文が入力され、さらにこのクエリ文が入出力装置１２を介してクエリ解析部１４に入力される。
クエリ文は、クエリ解析部１４で形態素解析プログラムにより形態素に分割される。形態素解析プログラムにおいては、文書ＤＢ１９に記憶されている言語辞書を用いる。形態素解析プログラムは、活用を伴う品詞に対しては、活用形の情報を与えるので、その情報と形態素文字列をもとに形態素文字列が変化するパターンを正規表現で表す。 Next, an overview of snippet generation processing in the search system will be described with reference to the main flowchart shown in FIG.
First, in step S1, the query analysis unit 14 performs a query sentence analysis process. A query sentence composed of text is input to a search box drawn on a computer monitor, and the query sentence is further input to the query analysis unit 14 via the input / output device 12.
The query sentence is divided into morphemes by the morpheme analysis program in the query analysis unit 14. In the morphological analysis program, a language dictionary stored in the document DB 19 is used. Since the morpheme analysis program gives information on the utilization form for the part of speech accompanied by utilization, the pattern in which the morpheme character string changes based on the information and the morpheme character string is represented by a regular expression.

例えば、次の英語の文章、
US, Japan mark 50 years since security pact signed
を形態素解析すると、
US/NNP ,/, Japan/NNP mark/VB 50/CD years/NNS since/IN security/NN pact/NN signed/VBN
（注意）ＮＮＰ：固有名詞、ＮＮ：普通名詞、ＶＢ：動詞、ＣＤ：数詞、ＮＮＳ：普通名詞、ＩＮ：前置詞、ＶＢＮ：過去分詞
となり、生成される正規表現パターンは、
US|Japan|mark|year(s)?|security|pact|sign(ed)?
となり、これがクエリ文解析の出力となる。
ただし、正規表現の対象となる品詞は、名詞、動詞、形容詞、副詞に限る。
文の区切りとなるシンボル集合ＥＯＳ｛’、’，’。’ ，’．’ ，’．’｝をクエリ文の出力となる正規表現パターンとの和集合をとり、新らたな正規表現とする。
以上のように、形態素解析においては、辞書として言語辞書のみを用いるので、形態素辞書以外の外部知識を利用せず、システムへの実装を簡単に行うことができる。 For example, the following English sentence:
US, Japan mark 50 years since security pact signed
Morphological analysis of
US / NNP, /, Japan / NNP mark / VB 50 / CD years / NNS since / IN security / NN pact / NN signed / VBN
(Caution) NNP: proper noun, NN: common noun, VB: verb, CD: number, NNS: common noun, IN: preposition, VBN: past participle, the generated regular expression pattern is
US | Japan | mark | year (s)? | Security | pact | sign (ed)?
This is the output of query statement analysis.
However, the part of speech that is the target of regular expressions is limited to nouns, verbs, adjectives, and adverbs.
Symbol set EOS {',', 'used as sentence delimiters. ','. ','. A new regular expression is obtained by taking the union of '} with the regular expression pattern that is the output of the query statement.
As described above, in the morpheme analysis, only the language dictionary is used as the dictionary, so that external knowledge other than the morpheme dictionary can be used and can be easily implemented in the system.

次いで、ステップＳ２では、重要文選択部１６は重要文選択処理を行う。まず、正規表現のパターン検索により文の区切りを行いつつ、検索語の出現頻度を計測する。
検索語の重みを検索語の文書内出現頻度（ＴＦ）と文書出現頻度（ＤＦ）、ＤＦが判らない場合は、擬似ＤＦを利用して計算する。検索語の重みを利用して、文書を構成する文の重みを計算し、対象文書から重みが大きい順から指定文字数を超えるまで文を選択する。
次いで、ステップＳ３では、文縮約部１７は文縮約処理を行う。選択した文の合計文字数が指定文字数より多い場合は、文中に含まれる検索語間の文字数と検索語の重みを考慮した計算により、１文字を削除する場所を特定し削除する操作を指定文字数に達するまで繰り返す。 Next, in step S2, the important sentence selection unit 16 performs an important sentence selection process. First, the appearance frequency of a search word is measured while sentence separation is performed by regular expression pattern search.
The search term weight is calculated using the pseudo-DF if the search term appearance frequency (TF) and the document appearance frequency (DF) are not known. The weights of sentences constituting the document are calculated using the weights of the search terms, and sentences are selected from the target document in descending order of weight until the specified number of characters is exceeded.
Next, in step S3, the sentence reduction unit 17 performs sentence reduction processing. If the total number of characters in the selected sentence is greater than the specified number of characters, the calculation that takes into account the number of characters between the search terms contained in the sentence and the weight of the search term will identify the location to delete one character and set the operation to delete it to the specified number of characters. Repeat until it reaches.

次に、図３〜図５のサブルーチンのフローチャートを参照して、図２に示すステップＳ２での重要文選択処理、およびステップＳ３での文縮約処理の方法についてより詳しく説明する。
[重要文選択処理]
まず、以下のようにシンボルを定義する。
すなわち、図３に示すサブルーチンのフローチャートに示すプログラムの実行に先立って、変数やバッファとして、テキスト、対象となる文書であるテキストＴ、正規表現パターンｐ、文の区切り文字パターンｅｏｓ、正規表現パターンの探索開始位置ＰＯＳ、正規表現パターンの探索装置ＲＥ、文書を構成する文に対応して存在してｗ_ijを格納するコンテナｓｅｎｔ_j、コンテナｓｅｎｔ_jにおいてｉ番目に探索されたパターンｗ_ij、ｗ_ijのＴにおける開始位置ｓ_ij、ｗ_ijのＴにおける終了位置ｅ_ij、ｗ_ijの文書内出現頻度ＴＦ（ｗ_ij）、ｗ_ijの文書出現頻度を計算する装置ｃａｌｃＤＦ（ｗ_ij）、システムで与えられる定数Ｎと宣言する。
ステップＡ１では、入力データとして、対象となる文書テキストＴ，正規表現パターンｐ，形態素辞書（転置ファイル）が入力される。 Next, the method of the important sentence selection process in step S2 and the sentence contraction process in step S3 shown in FIG. 2 will be described in more detail with reference to the flowcharts of the subroutines of FIGS.
[Important sentence selection processing]
First, define a symbol as follows.
That is, prior to execution of the program shown in the flowchart of the subroutine shown in FIG. 3, text, text T as the target document, regular expression pattern p, sentence delimiter pattern eos, regular expression pattern search start position POS, normal seeker RE expression pattern, containers sent the _j for storing w _ij exists corresponding sentences of a document, container sent the _j i th search pattern w _ij in, w _ij s _ij starting position in T, then end position e _ij in T of w _ij, document the frequency TF of w _{_ij} (w _ij), to calculate the document frequency of w _ij device calcDF (w _ij), given by the system Declared constant N.
In step A1, the target document text T, regular expression pattern p, and morpheme dictionary (transposed file) are input as input data.

次いで、ステップＡ２では、初期化処理として、ｉ番目、ｊ番目、ｉ番目の正規表現パターンの探索開始位置ＰＯＳ_i、正規表現パターンｐについて、ｉ←０，ｊ←０，ＰＯＳ_ij←０，ｐ←ｐ｜ＥＯＳと代入して初期化する。
次いで、ステップＡ３では、正規表現パターンの探索装置ＲＥ（仮想的な装置）により、文書テキストＴ、正規表現パターンｐ、ｉ番目の正規表現パターンの探索開始位置ＰＯＳ_ijについてパターン探索を行う。探索結果は文字列とその出現位置とで、ｉ番目に探索されたパターンｗ_ij、ｗ_ijの文書テキストＴにおける開始位置ｓ_ij、ｗ_ijのＴにおける終了位置ｅ_iについて取得する。
｛ｗ_ij，ｓ_ij，ｅ_ij｝←ＲＥ（Ｔ，ｐ，ＰＯＳ_i）
と表す。
次いで、ステップＡ４では、ｉ番目に探索されたパターンｗ_ijの文書テキストＴにおける開始位置ｓ_ijが、ｓ_ij≧０となり０以上であるか否かを判断する。ｓ_ij＜０ならばステップＡ１０へジャンプする。 Next, in step A2, as initialization processing, i ← 0, j ← 0, POS _ij ← 0, p for the search start position POS _i of the i-th, j-th, and i-th regular expression patterns and the regular expression pattern p. ← p | Substitute EOS and initialize.
Next, in step A3, the regular expression pattern search device RE (virtual device) performs a pattern search for the document text T, the regular expression pattern p, and the search start position POS _ij of the i-th regular expression pattern. The search result is obtained for the start position s _ij in the document text T of the i-th searched pattern w _ij and w _{ij and} the end position e _{i in} T of w _{ij with} the character string and its appearance position.
{W _ij , s _ij , e _ij } ← RE (T, p, POS _i )
It expresses.
Then, in step A4, starting position s _ij in the document text T of the search pattern w _ij in the i-th determines whether it is s _ij ≧ 0 becomes 0 or more. If s _ij <0, jump to Step A10.

次いで、ステップＡ５では、ｉ番目に探索されたパターンｗ_ijをコンテナｓｅｎｔ_jに格納する。
ｓｅｎｔ_j←｛ｗ_ij｝
次いで、ステップＡ６では、ｉ番目に探索されたパターンｗ_ijのＴにおける終了位置ｅ_ijを１つインクリメントした値を正規表現パターンの探索開始位置ＰＯＳ_iに格納する。
ＰＯＳ_ij←ｅ_ij＋１
次いで、ステップＡ７では、ｉを１つインクリメントした値をｉ番目として格納する。
ｉ←ｉ＋１
次いで、ステップＡ８では、ｉ番目に探索されたパターンｗ_ijが文の区切り文字パターンＥＯＳに一致するか否かを判断する。両者が一致する場合には、ステップＡ９に進み、ｊを１つインクリメントした値をｊ番目として格納し、さらにステップＡ３へジャンプする。
ｊ←ｊ＋１
他方、ステップＡ８において、両者が一致しない場合には、ステップＡ３へジャンプする。 Next, in step A5, the i-th searched pattern w _ij is stored in the container sent _j .
sent _j ← {w _ij }
Next, in step A6, it stores the end position e _ij in T search pattern w _ij in the i-th one increment value to the search start position POS _i regular expression pattern.
POS _ij ← e _ij +1
Next, in step A7, a value obtained by incrementing i by 1 is stored as the i-th value.
i ← i + 1
Next, in step A8, it is determined whether or not the i-th searched pattern w _ij matches the sentence delimiter pattern EOS. If they match, the process proceeds to step A9, the value obtained by incrementing j by 1 is stored as the jth, and the process jumps to step A3.
j ← j + 1
On the other hand, if they do not match at step A8, the process jumps to step A3.

次に、図３に示すステップＡ１０〜Ａ１３において、各ｊについてｓｅｎｔ_jに格納されている｛ｗ_ij｝をすべてのｉに対して式１を実行する。

・・・（式１）
ここで、ｉは文Ｓｅｎｔ_jに含まれる単語番号の範囲で変化する。この式１では、ｓｅｎｔ_j中に出現する単語の文書内出現頻度ＴＦ（ｗ_ij）とＩＤＦ値を掛けて、総和を計算してコンテナｓｅｎｔ_jのスコアとする。
まず、ステップＡ１０では、ＳＮ←ｊ，ｊ←０と代入して初期化する。
ＳＮ←ｊ，ｊ←０
次いで、ステップＡ１１では、システムで与えられる定数Ｎをｉ番目に探索されたパターンｗ_ijの文書出現頻度ｃａｌｃＤＦで除算した値に対して対数を求め、求めた対数値にパターンｗ_ijの文書内出現頻度ＴＦ（ｗ_ij）値を乗算した結果値について、合計関数Σを計算してｓｃｏｒｅ（ｓｅｎｔ_j）とする。
ここで、ｌｏｇをとらなかった場合、（総文書数／文書頻度）の値が大きくなりすぎ、単語の文書内出現頻度の効果が計算結果に反映され難い。これに対して、自然対数ｌｏｇをとると、（総文書数／文書頻度）の値が大きくなりすぎず、式１において単語の文書内出現頻度の効果を出すことができる。
なお、ｌｏｇ（総文書数／文書頻度）は、ＩＤＦ(inverted document frequency)として、計算機言語学では一般的指標になっている。 Next, in steps A10 to A13 shown in FIG. 3, {w _ij } stored in sent _j for each j is executed for all i.

... (Formula 1)
Here, i changes within a range of word numbers included in the sentence Sent _j . In this equation 1, the word appearance frequency TF (w _ij ) of the word appearing in the sentence _j is multiplied by the IDF value, and the sum is calculated as the score of the container cent _j .
First, in step A10, initialization is performed by substituting SN ← j and j ← 0.
SN ← j, j ← 0
Then, in step A11, obtains the log on the values of the given constant N was divided by document frequency calcDF search pattern w _ij in the i-th in the system, the document pattern w _ij on the obtained logarithm appearance For the result value obtained by multiplying the frequency TF (w _ij ) value, the total function Σ is calculated to be score (sent _j ).
Here, if log is not taken, the value of (total number of documents / document frequency) becomes too large, and the effect of the appearance frequency of words in the document is hardly reflected in the calculation result. On the other hand, if the natural logarithm log is taken, the value of (total number of documents / document frequency) does not become too large, and the effect of the appearance frequency of words in the document can be obtained in Equation 1.
Log (total number of documents / document frequency) is a general index in computer linguistics as IDF (inverted document frequency).

次いで、ステップＡ１２では、ｊを１つインクリメントした値をｊ番目として格納する。
ｊ←ｊ＋１
次いで、ステップＡ１３では、ｊとＳＮとの値を比較する。
ｊ≧ＳＮ
となりｊがＳＮ以上であるか否かを判断する。ｊ＜ＳＮならばステップＡ１１へジャンプする。
他方、ステップＡ１３において、ｊがＳＮ以上である場合には処理を終了して、メインルーチンに戻る。
以上のように、図３〜図４に示すステップＡ１〜Ａ１３により、文書を構成する文にスコアを付与する。
以上のようにして、クエリ文中の検索語を正規表現にすることにより、１回の文書走査で検索語の活用形も考慮した重要文を選択することができる。 Next, in step A12, a value obtained by incrementing j by one is stored as the jth.
j ← j + 1
Next, in step A13, the values of j and SN are compared.
j ≧ SN
Next, it is determined whether j is greater than or equal to SN. If j <SN, jump to step A11.
On the other hand, if j is greater than or equal to SN in step A13, the process is terminated and the process returns to the main routine.
As described above, a score is assigned to a sentence constituting a document by steps A1 to A13 shown in FIGS.
As described above, by using a regular expression as a search word in a query sentence, an important sentence that takes into account the utilization form of the search word can be selected in one document scan.

上記の説明において、パターンｗ_ijの文書出現頻度を求める装置ｃａｌｃＤＦ（仮想的な装置）の実装はハードウエアから構成されるシステムの規模により異なる。
例えば、大規模な検索システムの場合は、通常、転置ファイルを使用しているので、転置ファイルに検索語の文書頻度を問い合わせる方式の実装になる。
他方、転置ファイルを使用しない小規模な検索システムでは、形態素辞書に登録されている単語生起コストＣ（単語の出現し易さ）または、１／Ｃを返す方式の実装になる。この理由は、Ｃの値が高いほど単語が出現し易いか、出現し難いかは辞書の実装方式により異なるためである。 In the above description, the implementation of the device calcDF (virtual device) for obtaining the document appearance frequency of the pattern w _ij differs depending on the scale of the system configured by hardware.
For example, in the case of a large-scale search system, since an inverted file is usually used, an implementation of a system that inquires the inverted file about the document frequency of the search term is implemented.
On the other hand, in a small-scale search system that does not use a transposed file, the implementation of a system that returns the word occurrence cost C (ease of appearance of words) registered in the morpheme dictionary or 1 / C is implemented. This is because the higher the value of C, the more likely or unlikely words will appear, depending on the dictionary implementation.

[文縮約処理]
図５を参照して、サブルーチンのフローチャートに示すプログラムの文縮約処理について説明する。
まず、以下のようにシンボルを定義する。
すなわち、図４に示すプログラムの実行に先立って、変数やバッファとして、選択された文Ｓｅｎｔ_j、Ｓｅｎｔ_j中のｉ番目に検索された単語ｗ_ij、指定した文字数ｍ、選択された文Ｓ_jのスコアｓｃｏｒｅ_j、ｊ番目の文の文字数Ｌ_j、選択された文の総文字数Ｍ、単語ｗ_ijの重みα_ij、単語ｗ_ijとｗ_i(j+1)の間に存在する文字数ｄ_ij,i(j+1)、と宣言する。
ステップＢ１では、選択された文の総文字数Ｍから指定した文字数ｍを減算した値が０以下になっていれば処理を終了し、メインルーチンに戻る。
（Ｍ−ｍ）≦０
上記の減算値が０より大きい場合には、ステップＢ２に進む。 [Sentence reduction processing]
With reference to FIG. 5, the sentence contraction process of the program shown in the flowchart of the subroutine will be described.
First, define a symbol as follows.
That is, prior to the execution of the program shown in FIG. 4, as a variable or buffer, the selected sentence Sent _j , the i-th searched word w _ij in Sent _j , the designated number of characters m, the selected sentence S _j score score j, _j-th character L _j statement number d _ij to exist between the total number M of the selected sentence, the weight of a word w _ij alpha _ij, word w _ij and w _{i (j + 1)} of _{, i (j + 1)} .
In step B1, if the value obtained by subtracting the designated number of characters m from the total number of characters M of the selected sentence is 0 or less, the process is terminated and the process returns to the main routine.
(M−m) ≦ 0
If the subtraction value is greater than 0, the process proceeds to step B2.

以下、ステップＢ２〜Ｂ５での各処理を実行することで、選択された文Ｓ_jから１文字を削除することができる。
まず、ステップＢ２では、ｉ〜ｎの中で（ｓｃｏｒｅ_j／Ｌ_j）の値が最も小さいｊを探索する。すなわち、ｊ番目に探索された単語ｗ_ijを格納するコンテナｓｅｎｔ_jの中に存在する選択された文Ｓ_jのスコアｓｃｏｒｅ_jをｊ番目の文の文字数Ｌ_jで除算した値を求め、この除算値のうち最小値をとるｊを求める。
次いで、ステップＢ３では、選択された文Ｓ_jがＺ個の単語が検索されたことと仮定し、正規表現ｗ_kのパターン数ｚ、単語ｗ_ijとｗ_i(j+1)の間の文字数ｄ_j,j+1、単語ｗ_jの重みα_jからｗ_ijとｗ_i(j+1)の間の距離を示す演算値、すなわち、
｛（ｄ_ij,i(j+1)×α_ij）／（α_ij＋α_i(j+1)）｝_i=0~z-1
により、この演算値のうち最大値をとるｉを求める。
本発明では、単語ｗ_ijの影響度合いを表す尺度として、ｗ_ijとｗ_i(j+1)の間の距離を用いている。
次いで、ステップＢ４では、正規表現の単語ｗ_ijとｗ_i(j+1)との間の文字から、
（ｄ_ij,i(j+1)×α_ij）／（α_ij＋α_i(j+1)）
の距離にある文字を１文字削除する。
次いで、ステップＢ５では、選択された文の総文字数Ｍから１つデクリメントした値をＭとして格納する。
Ｍ←Ｍ−１
次いで、ステップＢ１へジャンプする。
この結果、クエリ文と関係のない部分から文字を徐々に削除していくことが可能になる。 Hereinafter, one character can be deleted from the selected sentence S _j by executing each process in steps B2 to B5.
First, in step B2, j having the smallest value of (score _j / L _j ) is searched for from i to n. That is, a value obtained by dividing the score score _j of the selected sentence S _j existing in the container sent _j storing the j-th searched word w _ij by the number of characters L _j of the j-th sentence is obtained. Find j which takes the minimum value among the values.
Next, in step B3, assuming that Z words have been searched in the selected sentence S _j , the number of patterns z of the regular expression w _{k and} the number of characters between the words w _ij and w _{i (j + 1)} d _{j, j + 1,} the calculated value indicating the distance between words w _j weights alpha _j from w _ij and w _{i (j + 1),} i.e.,
{(D _{ij, i (j + 1)} × α _ij ) / (α _ij + α _{i (j + 1)} )} _{i = 0 to z−1}
Thus, i which takes the maximum value among the calculated values is obtained.
In the present invention, the distance between w _ij and w _{i (j + 1)} is used as a scale representing the degree of influence of the word w _ij .
Next, in step B4, from the characters between the regular expression words w _ij and w _{i (j + 1)} ,
(D _{ij, i (j + 1)} × α _ij ) / (α _ij + α _{i (j + 1)} )
Delete one character at a distance of.
Next, in step B5, a value decremented by one from the total number of characters M of the selected sentence is stored as M.
M ← M-1
Next, jump to Step B1.
As a result, it is possible to gradually delete characters from a portion unrelated to the query sentence.

以上のように、スコアの降順に文を整列し、スコア付けされた文を上位から順に選択していき、指定文字数に達するまで、上位から文を選択することでスニペットを生成することができ、指定文字数を超えた場合は、選択した文を縮約する。
さらに、スニペットの最も重要でない部分から文字を削除していくので、ユーザが目的とするスニペットを指定文字数で生成することができる。
以上のように、スニペット生成装置１１にあっては、クエリ解析部１４が、与えられた文章を形態素解析して形態素に分解し、各形態素の活用語形を考慮したパターンを正規表現に変換し、重要文選択部１６が、対象となる文書に対しパターン検索を行い、正規表現の出現頻度を計測し、正規表現の文書内出現頻度と文書出現頻度とに基づいて文書を構成する文のスコアを計算し、スコアの高い順に上位ｎ個の文を選択し、文縮約部１７が、ｎ個の文から指定された文字数になるように不必要な部分を削除してスニペットとするので、文章と関連する文書が入力された場合、文書から文章と関連する部分を抜き出し、指定された文字数の要約文を生成することができる。 As described above, you can arrange sentences in descending order of score, select sentences from the top in order, and generate a snippet by selecting sentences from the top until the specified number of characters is reached, If the specified number of characters is exceeded, the selected sentence is reduced.
Furthermore, since characters are deleted from the least important part of the snippet, the target snippet can be generated with the specified number of characters.
As described above, in the snippet generation device 11, the query analysis unit 14 morphologically analyzes a given sentence and decomposes it into morphemes, and converts the patterns taking into account the word forms of each morpheme into regular expressions, The important sentence selection unit 16 performs a pattern search on the target document, measures the appearance frequency of the regular expression, and calculates the score of the sentence constituting the document based on the appearance frequency of the regular expression in the document and the document appearance frequency. Calculate and select the top n sentences in descending order, and the sentence contraction unit 17 deletes unnecessary parts from the n sentences so that the number of characters specified becomes a snippet. When a document related to is input, a portion related to the sentence is extracted from the document, and a summary sentence with a specified number of characters can be generated.

また、重要文選択部１６が、正規表現の文書内出現頻度と文書出現頻度とに基づいて、検索語の重みを計算し、検索語の重みを用いて文書を構成する文の重みを計算し、対象文書から重みが大きい順から指定文字数を超えるまで文を選択するので、１回の文書走査で検索語の活用形も考慮した重要文を選択することができる。
クエリ文中の検索語を正規表現にすることにより、１回の文書走査で検索語の活用形も考慮した重要文を選択することができ、与えられた文章に依存して、スニペットを動的に生成することができる。この際に、予め検索対象となる文書側でスニペットまたはヒントを用意する必要はない。
また、形態素解析においては、辞書として言語辞書のみを用いるので、形態素辞書以外の外部知識を利用せず、システムへの実装を簡単に行うことができる。
さらに、最も重要でない部分から１文字を漸次削除していく処理を繰り返すようにして、指定文字数にスニペット長を調整するので、ユーザが目的とするスニペットを指定文字数で生成することができる。 The important sentence selection unit 16 calculates the weight of the search word based on the regular expression appearance frequency and the document appearance frequency of the regular expression, and calculates the weight of the sentence constituting the document using the search word weight. Since sentences are selected from the target document in descending order of weight until the specified number of characters is exceeded, it is possible to select an important sentence that also takes into account the search word utilization form in a single document scan.
By using a regular expression as a search term in a query sentence, it is possible to select an important sentence that takes into account the use of the search term in a single document scan, and dynamically changes the snippet depending on the given sentence. Can be generated. At this time, it is not necessary to prepare a snippet or a hint in advance on the document side to be searched.
In addition, in the morphological analysis, only the language dictionary is used as the dictionary, so that external knowledge other than the morpheme dictionary can be used and can be easily implemented in the system.
Furthermore, the process of gradually deleting one character from the least important part is repeated, and the snippet length is adjusted to the designated number of characters, so that the target snippet can be generated with the designated number of characters.

＜例文を用いた処理例＞
下記の例文は、『第１次近衛内閣』というタイトルのＷｉｋｉｐｅｄｉａの記事から引用（２０１０年６月４日）したものであり、説明の便宜上、記事のタイトルを「第１次近衛内閣と日中戦争」と変更したものである。
［重要文選択処理］
第１次近衛内閣は、元老・西園寺公望の奏薦を受けて、公爵・貴族院議長の近衛文麿に大命降下され、組閣した内閣である。内閣発足の１ヶ月後に勃発した日中戦争（支那事変）については、当初、不拡大方針を閣議決定して、事態の早期収束を図った。しかし、軍部強硬派の圧力により戦争は拡大。和平交渉にも失敗して、翌１９３８年（昭和１３年）１月には、「爾後国民政府を対手とせず」という、いわゆる「近衛声明」（第一次近衛声明）を発表した。また、同年４月には国家総動員法を制定して戦時体制を整えた。同年１１月には「東亜新秩序建設」を戦争目的と規定する声明（東亜新秩序声明、第二次近衛声明）を発表し、同年１２月には親日派の汪兆銘の重慶脱出を受けて「近衛三原則」（善隣友好、共同防共、経済提携）を日中和平の基本方針として呼びかける声明（第三次近衛声明）を発表した。１９３９年（昭和１４年）１月に、内閣総辞職。 <Example of processing using example sentences>
The following example sentence is quoted from an article by Wikipedia (June 4, 2010) titled “First Konoe Cabinet”. For convenience of explanation, the title of the article is “The first Konoe Cabinet and Japan-China It was changed to “war”.
[Important sentence selection processing]
The 1st Konoe Cabinet was reunited by Konno Bunri, the chairman of the Duke and Aristocracy, in response to the recommendation of the elder and the son of Saionji. Regarding the Sino-Japanese War that broke out one month after the cabinet was established, the Cabinet decided on a non-expansion policy at the beginning, aiming for an early convergence of the situation. However, the war expanded due to the pressure of the hard-armed military. The peace negotiations also failed, and in January 1938, a so-called “Konde statement” (the first Konoe statement) was announced, “Do not take the post-National government”. In April of the same year, the National Mobilization Act was enacted to prepare a wartime system. In November of the same year, a statement ("Toa New Order Statement, Second Konoe Statement) that prescribes" Toa New Order Construction "as the purpose of the war was issued. Announced a statement (Third Konoe Statement) calling for the three principles of Konoe (Friendship of Good Neighbors, Joint Defense and Economic Partnership) as the basic policy of Nikkeihei. In January 1939, the Cabinet resigned.

なお、上記の例文はソフトウエア処理において、テキストＴである。
クエリ文（記事タイトル）から正規表現パターンを生成する。
ここで、クエリ文となる記事タイトル「第１次近衛内閣と日中戦争」から、以下のように正規表現パターンを生成する。
クエリ文（記事タイトル）を形態素解析すると次のようになる（Ｍｅｃａｂによる出力）。なお、Ｍｅｃａｂは、ＢＳＤライセンスの形態素解析プログラムを示す。
第接頭詞，数接続，＊，＊，＊，＊，第，ダイ，ダイ
１名詞，数，＊，＊，＊，＊，＊
次名詞，接尾，助数詞，＊，＊，＊，次，ジ，ジ
近衛名詞，一般，＊，＊，＊，＊，近衛，コノエ，コノエ
内閣名詞，一般，＊，＊，＊，＊，内閣，ナイカク，ナイカク
と助詞，並立助詞，＊，＊，＊，＊，と，ト，ト
日名詞，固有名詞，地域，国，＊，＊，日，ニチ，ニチ
中名詞，固有名詞，地域，国，＊，＊，中，チュウ，チュー
戦争名詞，サ変接続，＊，＊，＊，＊，戦争，センソウ，センソー
これから不要な文字列を削除して、正規表現パターンを生成すると、
ｐ←近衛｜内閣｜日｜中｜戦争
となる。 Note that the above example sentence is the text T in the software processing.
A regular expression pattern is generated from a query sentence (article title).
Here, a regular expression pattern is generated from the article title “First Konoe Cabinet and Sino-Japanese War” as a query sentence as follows.
The morphological analysis of the query sentence (article title) is as follows (output by Mecab). Note that Mecab is a BSD license morphological analysis program.
Prefix, number connection, *, *, *, *, number, die, die 1 noun, number, *, *, *, *, *
Next Noun, Suffix, Classifier, *, *, *, Next, Di, Di Konoe Noun, General, *, *, *, *, Konoe, Konoe, Konoe Cabinet Noun, General, *, *, *, *, Cabinet , Nyaku, Nyaku and particle, parallel particle, *, *, *, *, and G, N, Noun, proper noun, region, country, *, *, day, Nichi, Nichi medium noun, proper noun, region, Country, *, *, Medium, Chu, Chu War Noun, Sa transformation connection, *, *, *, *, War, Senso, Senso If you delete a character string from now on and generate a regular expression pattern,
p ← Konoe | Cabinet | Japan | China | War.

ｐに文区切りのパターンＥＯＳを追加
ＥＯＳ←￥ｎ｜￥ｒ｜。｜．｜．
ここで、ピリオドには半角、全角があることに注意する必要がある。
ｐ←ｐ｜ＥＯＳ
ＰＯＳを０にセットし、Ｔ、ｐとともに正規表現パターンの探索装置ＲＥの入力とする。
こうすることで、ＲＥは順々に戻り値として次の情報を返す。（Ａ４〜Ａ９の処理ステップに従う）
｛“近衛”、３、５｝
｛“内閣”、５、７｝
｛“近衛”、３４、３６｝
｛“内閣”、５０、５２｝
｛“。”、５５、５６｝
｛“内閣”、５６、５８｝
｛“日”、７０、７１｝
｛“中”、７１、７２｝
｛“戦争”、７２、７４｝
｛“。”、１１３、１１４｝
｛“戦争”、１２９、１３１｝
｛“。”、１３４、１３５｝
｛“近衛”、１８７、１８９｝
｛“近衛”、１９６、１９８｝
｛“。”、２０６、２０７｝
｛“。”、２３５、２３６｝
｛“戦争”、２５３、２５５｝
｛“近衛”、２７７、２７９｝
｛“日”、２９４、２９５｝
｛“近衛”、３１０、３１２｝
｛“日”、３３３、３３４｝
｛“中”、３３４、３３５｝
｛“近衛”、３５６、３５８｝
｛“。”、３６６、３６７｝
｛“内閣”、３８３、３８５｝
｛“。”、３８８、３８９｝ A sentence delimiter pattern EOS is added to p. EOS ← ¥ n | ¥ r |. |. |.
Here, it should be noted that the period has half-width and full-width.
p ← p ｜ EOS
POS is set to 0 and is input to the regular expression pattern search device RE together with T and p.
By doing so, the RE sequentially returns the following information as a return value. (According to processing steps A4 to A9)
{"Kone", 3, 5}
{"Cabinet", 5, 7}
{"Konoe", 34, 36}
{"Cabinet", 50, 52}
{".", 55, 56}
{"Cabinet", 56, 58}
{"Day", 70, 71}
{"Medium", 71, 72}
{"War", 72, 74}
{".", 113, 114}
{"War", 129, 131}
{".", 134, 135}
{"Konoe", 187, 189}
{"Konoe", 196, 198}
{".", 206, 207}
{".", 235, 236}
{"War", 253, 255}
{"Konoe", 277, 279}
{"Sun", 294, 295}
{"Konoe", 310, 312}
{"Sun", 333, 334}
{"Medium", 334, 335}
{"Konoe", 356, 358}
{".", 366, 367}
{"Cabinet", 383, 385}
{".", 388, 389}

ここで、“。”が文の区切りであることに注意し、Ａ４〜Ａ９の処理が終了すると、ｓｅｎｔ_jには次にようにデータが格納される。
ｓｅｎｔ₀＝｛“近衛”、“内閣”、“近衛”、“内閣”｝
ｓｅｎｔ₁＝｛内閣”、“日”、“中”、“戦争”｝
ｓｅｎｔ₂＝｛“戦争”｝
ｓｅｎｔ₃＝｛近衛”、近衛”｝
ｓｅｎｔ₄＝｛｝
ｓｅｎｔ₅＝｛“戦争”、“近衛”、“日”、“近衛”、“日”、“中”、“近衛”｝
ｓｅｎｔ₆＝｛“内閣”｝
検索語の重み計算には、定数ＮとｃａｌｃＤＦ（ｗ）が必要なので、次のように仮定する。
Ｎ←１００，０００
ｃａｌｃＤＦ（近衛）＝８０
ｃａｌｃＤＦ（内閣）＝１２０
ｃａｌｃＤＦ（日）＝９０００
ｃａｌｃＤＦ（中）＝４５００
ｃａｌｃＤＦ（戦争）＝２６００
クエリ文から抽出した各検索語の重みは、
α（近衛）＝７×ｌｏｇ（１０００００／８０）＝４９．９２
α（内閣）＝４×ｌｏｇ（１０００００／１２０）＝２６．９０
α（日）＝３×ｌｏｇ（１０００００／９０００）＝７．２２
α（中）＝２×ｌｏｇ（１０００００／４５００）＝６．２０
α（戦争）＝３×ｌｏｇ（１０００００／２６００）＝１０．９５
となる。 Note that “.” Is a sentence delimiter, and when the processing of A4 to A9 is completed, data is stored in the sent _j as follows.
sent ₀ = {“Konoe”, “Cabinet”, “Konoe”, “Cabinet”}
sent ₁ = {Cabinet "," Day "," Medium "," War "}
sent ₂ = {“war”}
sent ₃ = {Konoe ", Konoe"}
sent ₄ = {}
sent ₅ = {"War", "Konde", "Sun", "Konde", "Sun", "Medium", "Konde"}
sent ₆ = {“Cabinet”}
Since the constant N and calcDF (w) are necessary for calculating the weight of the search term, the following assumption is made.
N ← 100,000
calcDF (Konbe) = 80
calcDF (Cabinet) = 120
calcDF (day) = 9000
calcDF (Medium) = 4500
calcDF (war) = 2600
The weight of each search term extracted from the query sentence is
α (Konoe) = 7 x log (100,000 / 80) = 49.92
α (Cabinet) = 4 x log (100,000 / 120) = 26.90
α (day) = 3 × log (100,000 / 9000) = 7.22
α (medium) = 2 × log (100,000 / 4500) = 6.20
α (war) = 3 x log (100,000 / 2600) = 10.95
It becomes.

これらの重みから、各文のスコアは、
ＳＣＯＲＥ（ｓｅｎｔ₀）＝α（近衛）＋α（内閣）＝７６．８２
ＳＣＯＲＥ（ｓｅｎｔ₁）＝α（内閣）＋α（日）＋α（中）＋α（戦争）＝５１．２７
ＳＣＯＲＥ（ｓｅｎｔ₂）＝α（戦争）＝１０．９５
ＳＣＯＲＥ（ｓｅｎｔ₃）＝α（近衛）＝４９．９２
ＳＣＯＲＥ（ｓｅｎｔ₄）＝０
ＳＣＯＲＥ（ｓｅｎｔ₅）＝α（戦争）＋α（近衛）＋α（日）＋α（中）＝７４．２９
ＳＣＯＲＥ（ｓｅｎｔ₆）＝α（内閣）＝２６．９０
となる。
ここで、スコアをキーとして文を降順整列すると、
ＳＣＯＲＥ（ｓｅｎｔ₀）＝７６．８２
ＳＣＯＲＥ（ｓｅｎｔ₅）＝７４．２９
ＳＣＯＲＥ（ｓｅｎｔ₁）＝５１．２７
ＳＣＯＲＥ（ｓｅｎｔ₃）＝４９．９２
ＳＣＯＲＥ（ｓｅｎｔ₆）＝２６．９０
ＳＣＯＲＥ（ｓｅｎｔ₂）＝１０．９５
ＳＣＯＲＥ（ｓｅｎｔ₄）＝０
となる。 From these weights, the score of each sentence is
SCORE (sent ₀ ) = α (Konoe) + α (Cabinet) = 76.82
SCORE (sent ₁ ) = α (Cabinet) + α (Sun) + α (Medium) + α (War) = 51.27
SCORE (sent ₂ ) = α (war) = 10.95
SCORE (sent ₃ ) = α (Konoe) = 49.92
SCORE (sent ₄ ) = 0
SCORE (sent ₅ ) = α (war) + α (Kanoe) + α (day) + α (medium) = 74.29
SCORE (sent ₆ ) = α (Cabinet) = 26.90
It becomes.
Here, if the sentences are arranged in descending order using the score as a key,
SCORE (sent ₀ ) = 76.82
SCORE (sent ₅ ) = 74.29
SCORE (sent ₁ ) = 51.27
SCORE (sent ₃ ) = 49.92
SCORE (sent ₆ ) = 26.90
SCORE (sent ₂ ) = 10.95
SCORE (sent ₄ ) = 0
It becomes.

［文縮約処理］
スニペットサイズの設定が１７２文字であると仮定すると、ｓｅｎｔ₀（５６文字）、ｓｅｎｔ₅（１３１文字）が選択されるが、総文字数が１８７文字になり、（１８７−１７２）＝１５文字を削除する必要がある。
ここで、上記の文縮約の方法に従い、ｓｅｎｔ₀、ｓｅｎｔ₅から１５文字を削除する過程を説明する。
ｓｅｎｔ₀ ＝第１次近衛内閣は、元老・西園寺公望の奏薦を受けて、公爵・貴族院議長の近衛文麿に大命降下され、組閣した内閣である。
ｓｅｎｔ₅ ＝同年１１月には「東亜新秩序建設」を戦争目的と規定する声明（東亜新秩序声明、第二次近衛声明）を発表し、同年１２月には親日派の汪兆銘の重慶脱出を受けて「近衛三原則」（善隣友好、共同防共、経済提携）を日中和平の基本方針として呼びかける声明（第三次近衛声明）を発表した。
Ｌ₀＝５６、Ｌ₅＝１３１
ｓｃｏｒｅ₀／Ｌ₀＝７６．８２／５６＝１．３７
ｓｃｏｒｅ₅／Ｌ₅＝７４．２９／１３１＝０．５７
なので、ｓｃｏｒｅ₅から１文字削除する。（ｓｅｎｔ₅から１５文字削除しても７４．２９／１１６＝０．６４なのでｓｅｎｔ₅のみ縮約すれば良い。）
Ｗ_BOS，｛“”、２３６、２３６｝
Ｗ₀₅｛“戦争”、２５３、２５５｝
Ｗ₁₅｛“近衛”、２７７、２７９｝
Ｗ₂₅｛“日”、２９４、２９５｝
Ｗ₃₅｛“近衛”、３１０、３１２｝
Ｗ₄₅｛“日”、３３３、３３４｝
Ｗ₅₅｛“中”、３３４、３３５｝
Ｗ₆₅｛“近衛”、３５６、３５８｝
Ｗ_EOS｛“。”、３６６、３６７｝
ｓｅｎｔ₅中に含まれる検索語間の距離（文字数）は、
ｄ_BOS5,05＝１７
ｄ_05,15＝２１
ｄ_15,25＝１６
ｄ_25,35＝１５
ｄ_35,45＝２１
ｄ_45,55＝０
ｄ_55,65＝２１
ｄ_65,75＝９
となる。 [Sentence reduction processing]
Assuming that the snippet size setting is 172 characters, sent ₀ (56 characters) and sent ₅ (131 characters) are selected, but the total number of characters is 187 characters, and (187-172) = 15 characters are deleted. There is a need to.
Here, the process of deleting 15 characters from “sent _0” and “sent _5” according to the above sentence reduction method will be described.
sent ₀ = first-order Konoe Cabinet, in response to the Sokomo of the Senate, Kinmochi Saionji, is Daiinochi drop in Fumimaro Konoe of Duke, the House of Lords chairman, a new Cabinet was the Cabinet.
Sent ₅ = In November of the same year, a statement ("Toa New Order Statement", the Second Konoe Statement) that prescribes the "Eastern New Order Construction" as the purpose of the war was announced. In response, we issued a statement (Third Konoe Statement) calling on the “Konen Three Principles” (good neighbor friendship, joint defense, and economic alliances) as the basic policy of Nikkeihei.
L ₀ = 56, L ₅ = 131
score ₀ / L ₀ = 76.82 / 56 = 1.37
score ₅ / L ₅ = 74.29 / 131 = 0.57
So, one character deleted from the score _5. (Even if 15 characters are deleted from sent _5, 74.29 / 116 = 0.64, so only sent ₅ may be reduced.)
W _BOS , {“”, 236, 236}
W ₀₅ {“War”, 253, 255}
W ₁₅ {“Konoe”, 277, 279}
W ₂₅ {“Day”, 294, 295}
W ₃₅ {“Konoe”, 310, 312}
W ₄₅ {"Sun", 333, 334}
W ₅₅ {“Medium”, 334, 335}
W ₆₅ {“Konoe”, 356, 358}
W _EOS {“.”, 366, 367}
The distance (number of characters) between search terms included in sent ₅ is
d _BOS5,05 = 17
d _05,15 = 21
d _15,25 = 16
d _25,35 = 15
d _35,45 = 21
d _45,55 = 0
d _55,65 = 21
d _65,75 = 9
It becomes.

次の計算では、文の先頭と最後に便宜上、重みが１の検索語が存在すると仮定する。すなわち、α_BOS，α_EOSを１．０とする。
（ｄ_BOS5,05×α_-15）／（α_BOS5＋α₀₅）＝（１７×１．０）／（１．０＋１０．９５）＝１．４２
（ｄ_05,15×α₀₅）／（α₀₅＋α₁₅）＝（２１×１０．９５）／（１０．９５＋４９．９２）＝３．７８
（ｄ_15,25×α₁₅）／（α₁₅＋α₂₅）＝（１６×４９．９２）／（４９．９２＋７．２２）＝１３．９８
（ｄ_25,35×α₂₅）／（α₂₅＋α₃₅）＝（１５×７．２２）／（７．２２＋４９．９２）＝１．９０
（ｄ_35,45×α₃₅）／（α₃₅＋α₄₅）＝（２１×４９．９２）／（４９．９２＋７．２２）＝１８．３５
（ｄ_45,55×α₄₅）／（α₄₅＋α₅₅）＝０
（ｄ_55,65×α₅₅）／（α₅₅＋α₆₅）＝（２１×６．２）／（６．２＋４９．９２）＝２．３２
（ｄ_65,75×α₆₅）／（α₆₅＋α_EOS5）＝（９×４９．９２）／（４９．９２＋１．０）＝８．８２
であるから、ｓｅｎｔ₅中のＷ₃₅｛“近衛”｝という文字列から１９文字目（１８．３５を切り上げ）の文字を一文字削除し、ｓｅｎｔ₅の１文字あたりの重みを計算すると、
ｓｃｏｒｅ₅／Ｌ₅＝７４．２９／１３０＝０．５７
となるので、ｓｅｎｔ₅からまた１文字削除する。 In the next calculation, it is assumed that there is a search term having a weight of 1 for the sake of convenience at the beginning and end of the sentence. That is, α _BOS and α _EOS are set to 1.0.
(D _BOS5,05 × α _-15 ) / (α _BOS5 + α ₀₅ ) = (17 × 1.0) / (1.0 + 10.95) = 1.42
(D _05,15 × α ₀₅ ) / (α ₀₅ + α ₁₅ ) = (21 × 10.95) / (10.95 + 49.92) = 3.78
(D _15,25 × α ₁₅ ) / (α ₁₅ + α ₂₅ ) = (16 × 49.92) / (49.92 + 7.22) = 13.98
(D _25,35 × α ₂₅ ) / (α ₂₅ + α ₃₅ ) = (15 × 7.22) / (7.22 + 49.92) = 1.90
(D _35,45 × α ₃₅ ) / (α ₃₅ + α ₄₅ ) = (21 × 49.92) / (49.92 + 7.22) = 18.35
(D _45,55 × α ₄₅ ) / (α ₄₅ + α ₅₅ ) = 0
(D _55,65 × α ₅₅ ) / (α ₅₅ + α ₆₅ ) = (21 × 6.2) / (6.2 + 49.92) = 2.32
(D _65,75 × α ₆₅ ) / (α ₆₅ + α _EOS5 ) = (9 × 49.92) / (49.92 + 1.0) = 8.82
Since it is, the characters of 19 th character from the string W ₃₅ { "Guards"} in sent The ₅ (round up 18.35) one character deletion of the calculation of the weight per letter sent The _5,
score ₅ / L ₅ = 74.29 / 130 = 0.57
Since the, to also remove one character from sent _5.

前回と異なるのは、ｄ_15,25だけなので、
（ｄ_35,45×α₃₅）／（α₃₅＋α₄₅）＝（２０×４９．９２）／（４９．９２＋７．２２）＝１７．４７
となるので、Ｗ₃₅｛“近衛”｝という文字列から１８文字目を削除（合計２文字削除）する。
このまま処理を続けて、Ｗ₃₅｛“近衛”｝という文字列から１５文字目を削除した時点（合計５文字削除）で、
（ｄ_35,45×α₃₅）／（α₃₅＋α₄₅）＝（１６×４９．９２）／（４９．９２＋７．２２）＝１３．９８
となる。
ここで、（ｄ_35,45×α₃₅）／（α₃₅＋α₄₅）は、（ｄ_15,25×α₁₅）／（α₁₅＋α₂₅）と同じ値になるので、今度は、Ｗ₁｛“近衛”｝という文字列から１４文字目を削除（同じ値なのでどちらの文字列の近傍を削除してもよいが、便宜上、この例では、同じ値の場合には、先に出現した文字列の近傍から削除するとことにした。）。
（ｄ_15,25×α₁₅）／（α₁₅＋α₂₅）＝（１５×４９．９２）／（４９．９２＋７．２２）＝１３．１
となるので、Ｗ₃₅｛“近衛”｝という文字列から１４文字目を削除する。
以上の処理を繰り返していく。 Only d _15,25 is different from the last time,
(D _35,45 × α ₃₅ ) / (α ₃₅ + α ₄₅ ) = (20 × 49.92) / (49.92 + 7.22) = 17.47
Therefore, the 18th character is deleted from the character string W ₃₅ {“Konoe”} (a total of 2 characters are deleted).
When processing is continued as it is, when the 15th character is deleted from the character string W ₃₅ {“Konoe”} (a total of 5 characters are deleted),
(D _35,45 × α ₃₅ ) / (α ₃₅ + α ₄₅ ) = (16 × 49.92) / (49.92 + 7.22) = 13.98
It becomes.
_{_{Here, (d 35,45 × α 35)}} / (α 35 + α 45) , since the same value as _{_{(d 15,25 × α 15) /}} (α 15 + α 25), in turn, W ₁ { The 14th character is deleted from the character string “Konoe”} (because of the same value, the vicinity of either character string may be deleted. However, for convenience, in this example, in the case of the same value, the character string that appears first is deleted. To delete from the vicinity of).
(D _15,25 × α ₁₅ ) / (α ₁₅ + α ₂₅ ) = (15 × 49.92) / (49.92 + 7.22) = 13.1
Therefore, the 14th character is deleted from the character string W ₃₅ {“Konoe”}.
The above process is repeated.

Ｗ₁₅｛“近衛”｝という文字列から１０文字目を削除した時点で、
（ｄ_15,25×α₁₅）／（α₁₅＋α₂₅）＝（１０×４９．９２）／（４９．９２＋７．２２）＝８．７４
（ｄ_35,45×α₃₅）／（α₃₅＋α₄₅）＝（１１×４９．９２）／（４９．９２＋７．２２）＝９．６１
となるので、Ｗ₃₅｛“近衛”｝という文字列から１０文字目を削除する。
これでｓｅｎｔ₅から１５文字削除し、縮約処理が完了する。 When the 10th character is deleted from the character string W ₁₅ {“Konoe”},
(D _15,25 × α ₁₅ ) / (α ₁₅ + α ₂₅ ) = (10 × 49.92) / (49.92 + 7.22) = 8.74
(D _35,45 × α ₃₅ ) / (α ₃₅ + α ₄₅ ) = (11 × 49.92) / (49.92 + 7.22) = 9.61
Therefore, the 10th character is deleted from the character string W ₃₅ {“Konoe”}.
This deletes 15 characters from the sentence ₅ , and the reduction process is completed.

以上の処理により、指定文字数に達し、生成されるスニペットは、
［処理結果］
第１次近衛内閣は、元老・西園寺公望の奏薦を受けて、公爵・貴族院議長の近衛文麿に大命降下され、組閣した内閣である
同年１１月には「東亜新秩序建設」を戦争目的と規定する声明（東亜新秩序声明、第二次近衛声明）を発表し、．．．日派の汪兆銘の重慶脱出を受けて「近衛三原則」（善隣友．．．済提携）を日中和平の基本方針として呼びかける声明（第三次近衛声明）を発表した。
となる。 With the above process, the specified number of characters is reached, and the generated snippet is
[Processing result]
The 1st Konoe Cabinet was defeated by Konno Bunri, the chairman of the Duke and Aristocratic House, in response to the recommendation of the elder and the son of Saionji. Announcement of purpose and statement (Toa New Order Statement, Second Konoe Statement). . . In response to the departure of Chongqing, the Japanese faction, a statement (Third Konoe Statement) calling for the “Konei Three Principles” (Zennei Friendship… Sai Partnership) as the basic policy of Nikkeihei.
It becomes.

１１スニペット生成装置、１２入出力装置、１３スニペット生成ブロック、１４クエリ解析部、１５スニペット生成部、１６重要文選択部、１７文縮約部、１８検索装置、１９文書データベース、２０検索結果取得部 11 snippet generation device, 12 input / output device, 13 snippet generation block, 14 query analysis unit, 15 snippet generation unit, 16 important sentence selection unit, 17 sentence reduction unit, 18 search device, 19 document database, 20 search result acquisition unit

特表２００８−５３５０９５号公報Special table 2008-535095 gazette 米国特許出願公開第２００９／０２９２６８３号明細書US Patent Application Publication No. 2009/0292683 特許第４２８２３８１号Japanese Patent No. 42282381 特許第４０８８１７６号Patent No. 4088176

http://labs.cybozu.co.jp/blog/kazuho/archives/2006/04/summarize.phphttp://labs.cybozu.co.jp/blog/kazuho/archives/2006/04/summarize.php

Claims

When a document related to a sentence is input, a method of extracting a part related to the sentence from the document and generating a summary sentence with a specified number of characters,
A sentence analysis step of analyzing a given sentence by morphological analysis and decomposing it into morphemes, and converting a pattern that takes into account the word form of each morpheme into a regular expression;
Perform a pattern search on the target document, measure the frequency of regular expressions, calculate the score of the sentences that make up the document based on the frequency of regular expressions in the document and the frequency of document occurrence, An important sentence selection step of selecting the top n sentences;
a snippet reduction method comprising: a sentence contraction step that deletes unnecessary portions from n sentences so as to have a specified number of characters, thereby forming a snippet.

The important sentence selection step includes:
Calculating a search term weight based on the regular expression occurrence frequency and the document occurrence frequency;
Calculating weights of sentences constituting the document using weights of search terms;
The snippet generation method according to claim 1, further comprising: selecting a sentence from the target document in descending order of weight until the specified number of characters is exceeded.

The sentence reduction step includes:
2. The snippet generation method according to claim 1, further comprising a step of adjusting the snippet length to a specified number of characters by repeating the process of gradually deleting one character from the least important part.

2. The snippet generation method according to claim 1, wherein in the morphological analysis of the sentence analysis step, only a language dictionary is used as a dictionary.

When a document related to a sentence is given, an apparatus that extracts a part related to the sentence from the document and generates a summary sentence of a specified number of characters,
Sentence analysis means for analyzing a given sentence by morphological analysis and decomposing it into a morpheme, and converting a pattern taking into account the word form of each morpheme into a regular expression;
Perform a pattern search on the target document, measure the frequency of regular expressions, calculate the score of the sentences that make up the document based on the frequency of regular expressions in the document and the frequency of document occurrence, An important sentence selection means for selecting the top n sentences;
A snippet generation apparatus comprising: a sentence contraction unit that deletes unnecessary parts from n sentences so as to have a specified number of characters, thereby forming a snippet.

When a document related to a sentence is input, a computer program for extracting a part related to the sentence from the document and generating a summary sentence with a specified number of characters,
A sentence analysis step of analyzing a given sentence by morphological analysis and decomposing it into morphemes, and converting a pattern that takes into account the word form of each morpheme into a regular expression;
Perform a pattern search on the target document, measure the frequency of regular expressions, calculate the score of the sentences that make up the document based on the frequency of regular expressions in the document and the frequency of document occurrence, An important sentence selection step of selecting the top n sentences;
A snippet generation program that causes a computer to execute a sentence contraction step in which unnecessary parts are deleted from n sentences so as to have a specified number of characters, thereby forming a snippet.

When a document related to a sentence is input, a recording medium on which a computer program for extracting a portion related to the sentence from the document and generating a summary sentence with a specified number of characters is recorded,
A sentence analysis step of analyzing a given sentence by morphological analysis and decomposing it into morphemes, and converting a pattern that takes into account the word form of each morpheme into a regular expression;
Perform a pattern search on the target document, measure the frequency of regular expressions, calculate the score of the sentences that make up the document based on the frequency of regular expressions in the document and the frequency of document occurrence, An important sentence selection step of selecting the top n sentences;
A snippet generation program is recorded that causes a computer to execute a sentence reduction step in which unnecessary parts are deleted from n sentences so as to have a specified number of characters, thereby forming a snippet. A characteristic recording medium.