JP2010122823A

JP2010122823A - Text processing system, information processing apparatus, method for processing text and information, and processing program

Info

Publication number: JP2010122823A
Application number: JP2008294778A
Authority: JP
Inventors: Katsushi Matsuda; 勝志松田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2008-11-18
Filing date: 2008-11-18
Publication date: 2010-06-03

Abstract

<P>PROBLEM TO BE SOLVED: To obtain a text processing system, an information processing apparatus, a method for processing texts and information, and a processing program, wherein a portion in which an important sentence is to be extracted from various digital text documents can be specified. <P>SOLUTION: The text processing system 10 includes: an action term extraction means 11 for extracting text information selected from segment data obtained by dividing text data into segments by a predetermined rule and coincident with any one of action terms which are terms for directly expressing the intentions of respective sentences for each segment; an action term comparison means 12 for comparing an action term by segment extracted in each segment with an integrated action term obtained by integrating all segments; and a segment discrimination means 13 for discriminating the segment extracting the segment sorted action term most similar to the integrated action term from the compared result as a segment to be an important portion in the text data of one document. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、電子化されたテキスト文書やテキスト文章等の文書からその要約や、その文書で主張する個所としての部分テキストを抽出するテキスト処理システム、前記した文書から前記した部分テキストを抽出する情報処理装置、これら部分テキストを抽出するテキスト処理方法および情報処理方法ならびにこれら部分テキストを抽出するテキスト処理プログラムおよび情報処理プログラムに関する。 The present invention relates to a text processing system that extracts an abstract from a document such as an electronic text document or a text sentence, or a partial text as a portion claimed in the document, and information to extract the partial text from the document. The present invention relates to a processing device, a text processing method and an information processing method for extracting these partial texts, and a text processing program and an information processing program for extracting these partial texts.

人が日常的に使っている言語を自然言語という。自然言語は、プログラミング言語のような人工的に定義された形式言語と比較すると、多様性に富んでいる。自然言語によって記述されたテキスト文章またはテキスト文書は、コンピュータ等の情報処理装置を使用して、数多くの人が日々、大量に作成している。そこで、これらの自然言語によって記述された大量なテキスト文章またはテキスト文書（以下、単に文書と略称する。）から、各人の要求する必要な文書を探し出すテキスト処理が要請されている。 Languages that people use on a daily basis are called natural languages. Natural languages are rich in variety when compared to artificially defined formal languages such as programming languages. A large number of text sentences or text documents written in a natural language are created daily by a large number of people using an information processing device such as a computer. Therefore, there is a demand for text processing for searching for a necessary document requested by each person from a large amount of text sentences or text documents (hereinafter simply referred to as documents) described in these natural languages.

このような要請を実現するために、文書から要約やその文書の主張箇所の文章を生成するテキスト処理システムが注目されている。テキスト処理システムで、文書から要約文章を自動的に生成することを自動テキスト要約という。文書の主張箇所を表わした文書は、テキスト要約と厳密な意味では異なる。しかしながら、本明細書ではこれも広義の要約文書として扱うことにする。 In order to realize such a request, attention has been paid to a text processing system that generates a summary and a sentence of an asserted portion of the document from a document. Automatic text summarization refers to automatically generating summary sentences from a document in a text processing system. The document that represents the claimed part of the document differs in a strict sense from the text summary. However, in the present specification, this is also treated as a broad summary document.

ところで、自動テキスト要約を生成する技術としては、文書から要約文章を構成するであろう重要文を抽出して、これら重要文を連結して、文章として自然になるように整形することが主流となっている。要約文章の元となる重要文の抽出方法としては、形態素解析を用いる手法が主流である。ここで形態素解析とは、文章を構成する最小の意味単位である形態素に分解する処理をいう。 By the way, as a technique for generating automatic text summaries, the mainstream is to extract important sentences that will make up a summary sentence from a document, concatenate these important sentences, and shape them as natural sentences. It has become. As a method for extracting an important sentence from which a summary sentence is based, a technique using morphological analysis is mainly used. Here, the morpheme analysis refers to a process of decomposing the morpheme, which is the smallest semantic unit constituting the sentence.

重要文の抽出には、まず、文書中の各文を形態素解析して、品詞が特定の名詞である用語を抽出する。本明細書で「用語」とは、単語や複合語、句等の総称である。文書から用語を抽出したら、それらの出現頻度から重要用語を特定する。そして、それらの重要用語を含む文を重要文とする。このように出現頻度を用いて重要用語を特定して重要文を抽出する手法は、本発明に関連する第１の関連技術として提案されている（たとえば特許文献１参照）。 To extract important sentences, first, each sentence in a document is analyzed by morphological analysis, and a term whose part of speech is a specific noun is extracted. In this specification, “term” is a general term for words, compound words, phrases, and the like. Once terms are extracted from the document, important terms are identified from their appearance frequency. A sentence including those important terms is designated as an important sentence. A technique for extracting an important sentence by specifying an important term using the appearance frequency is proposed as a first related technique related to the present invention (see, for example, Patent Document 1).

また、出現頻度を単に使用する第１関連技術と異なり、単語の出現密度を求めて重要文を抽出する手法が本発明の第２の関連技術として提案されている（たとえば特許文献２参照）。この第２の関連技術では、まず文書を形態素解析し、要約の種別に応じて要約の手がかりとして必要な単語の集合を文書から抽出する。そして、文書を複数の意味的なまとまりに分割して、単語の集合に含まれる単語の出現密度の高い重要部分を算出する。この算出した重要部分から所与の要約率に応じて文を抽出する。 Also, unlike the first related technique that simply uses the appearance frequency, a technique for obtaining the appearance density of words and extracting an important sentence has been proposed as a second related technique of the present invention (see, for example, Patent Document 2). In the second related technique, a document is first subjected to morphological analysis, and a set of words necessary as a clue for summarization is extracted from the document according to the type of summary. Then, the document is divided into a plurality of semantic groups, and an important portion having a high appearance density of words included in the word set is calculated. A sentence is extracted from the calculated important part according to a given summary rate.

更に、文書の種類に応じた要約文章のシナリオを用意しておいて、文書からシナリオに応じた述語を含む文を重要文とすることが、本発明の第３の関連技術として提案されている（たとえば特許文献３参照）。この第３の関連技術では、抄録等の重要文を作成する論文等の文章について、当該分野の代表的なシナリオを投射し、このシナリオの投射によって筆者が本来いわんとする粗筋や構想であるプロットのみを抽出して重要文を自動的に作成するようにしている。 Furthermore, it is proposed as a third related technique of the present invention that a summary sentence scenario corresponding to the type of document is prepared and a sentence including a predicate corresponding to the scenario is made an important sentence from the document. (For example, refer to Patent Document 3). In this third related technology, a typical scenario in the relevant field is projected on a sentence such as a paper that creates an important sentence such as an abstract, etc., and this scenario is a rough line or concept that the author originally says. An important sentence is automatically created by extracting only the plot.

ところで、比較的長い文書には、要約文章や、それと同等な抄録が付加されていることが多い。そこで本発明の第４の関連技術として、「概要」等の要約文章に関連する小見出しが付いた節を要約文章の一部とすることを提案している（たとえば特許文献４参照）。この第４の関連技術では、たとえば「あらまし」等のように要約処理を施す必要のない部分は、原文書のまま要約結果に反映させることにしている。 By the way, a comparatively long document is often accompanied by a summary sentence or an equivalent abstract. Therefore, as a fourth related technique of the present invention, it is proposed that a section with a subheading related to a summary sentence such as “summary” is made a part of the summary sentence (see, for example, Patent Document 4). In the fourth related technique, for example, a part that does not need to be summarized, such as “summary”, is reflected in the summary result as the original document.

更に、重要なパラグラフを特定するための辞書を用いて、要約文章の候補の一部とすることが本発明の第５の関連技術として提案されている（たとえば特許文献５参照）。この第５の関連技術では、重要なパラグラフを抽出するために利用する見出し語を登録している見出し語辞書をテキストデータとパターンマッチングすることで重要なパラグラフを特定する。そして、重要パラグラフ内から仮要約を特定し、不要個所を削除して要約を自動的に作成する。
特開２００２−２９７６３５号公報（第００９段落、図２）特開２００２−２５９３７１号公報（第００８段落、図１）特開平０７−０１３９６７号公報（第００１０段落、図１）特開平０６−０１２４４７号公報（第００２１段落、第０１０１段落、図１２）特開平０６−２５９４２３号公報（第０００７段落、図１） Furthermore, it has been proposed as a fifth related technique of the present invention to use a dictionary for identifying important paragraphs as a part of a summary sentence candidate (see, for example, Patent Document 5). In the fifth related technique, an important paragraph is specified by pattern matching a headword dictionary in which headwords used for extracting an important paragraph are registered with text data. Then, a temporary summary is identified from the important paragraph, and unnecessary portions are deleted to automatically create a summary.
Japanese Patent Laying-Open No. 2002-297635 (paragraph 009, FIG. 2) JP 2002-259371 A (paragraph 008, FIG. 1) JP 07-013967 (paragraph 0010, FIG. 1) JP-A-06-012447 (paragraphs 0021 and 0101, FIG. 12) Japanese Patent Laid-Open No. 06-259423 (paragraph 0007, FIG. 1)

しかしながら、第１〜第３の関連技術に示されるように、文書から用語を手掛かりにして文を抽出して重要文を作成する手法では、文書から断片的な文が抽出されることになる。このため、人手により作成される要約文章等の重要文と比べると品質が十分とはいえない。 However, as shown in the first to third related techniques, in the method of creating an important sentence by extracting a sentence from a document by using a term as a clue, a fragmentary sentence is extracted from the document. For this reason, quality cannot be said to be sufficient compared with important sentences such as summary sentences created manually.

また、第４あるいは第５の関連技術に示されるように、文書に存在する要約相当の節やパラグラフを特定する手法では、予め専用の辞書やルールを準備しておく必要がある。あるいは、文書から重要文を作成する個所を特定するための小見出しが、文書を構成する節やパラグラフに付与されている必要がある。このため、重要文を作成する際の対応できる文書が限定されることになるという問題があった。 Further, as shown in the fourth or fifth related technique, in the method of specifying a section or paragraph corresponding to a summary existing in a document, it is necessary to prepare a dedicated dictionary or rule in advance. Alternatively, a subheading for specifying a location where an important sentence is created from a document needs to be given to a section or paragraph constituting the document. For this reason, there is a problem that documents that can be handled when creating an important sentence are limited.

そこで本発明の目的は、電子化された各種のテキスト文書から重要文を抽出する個所を特定することのできるテキスト処理システム、情報処理装置、テキストおよび情報の処理方法ならびに処理プログラムを提供することにある。 SUMMARY OF THE INVENTION Accordingly, an object of the present invention is to provide a text processing system, an information processing apparatus, a text and information processing method, and a processing program capable of specifying a location where an important sentence is extracted from various digitized text documents. is there.

本発明では、（イ）１編の文書を構成するテキストデータを一まとまりの文章の範囲としてのセグメントに分割してなるセグメントデータから、予め定められた所定の規則によって選出され、それぞれの文の意図を端的に表現する用語としての行為用語のいずれかに一致するテキスト情報を、各セグメントについて抽出する行為用語抽出手段と、（ロ）この行為用語抽出手段で抽出したそれぞれのセグメントごとの行為用語としてのセグメント別行為用語と、前記した１編の文書を構成するテキストデータについての全セグメントを統合した行為用語としての統合行為用語とを比較する行為用語比較手段と、（ハ）この行為用語比較手段の比較結果から前記した統合行為用語に最も似通ったセグメント別行為用語を抽出したセグメントを前記した１編の文書のテキストデータについての要部となるセグメントと判別するセグメント判別手段とをテキスト処理システムが具備する。 In the present invention, (a) text data constituting one document is selected from segment data obtained by dividing the text data into segments as a group of sentence ranges according to a predetermined rule. Action term extraction means for extracting for each segment text information that matches one of the action terms as terms that express the intention, and (b) action terms for each segment extracted by this action term extraction means. Action term comparing means for comparing the action terms according to segments and the integrated action terms as the action terms obtained by integrating all the segments of the text data constituting the one document described above, and (c) this action term comparison From the comparison results of the means, the segment obtained by extracting the segment-specific action terms most similar to the above-mentioned integrated action terms is 1 Part text processing system and a segment discriminating means for discriminating the main part and comprising a segment of the text data of the document of are provided.

また、本発明では、（イ）１編の文書を構成するテキストデータを一まとまりの文章の範囲としてのセグメントに分割するセグメント分割手段と、（ロ）前記した請求項１〜請求項１５いずれかに記載のテキスト処理システムとを情報処理装置が具備する。 In the present invention, (b) segment dividing means for dividing text data constituting one document into segments as a group of sentences, and (b) any one of claims 1 to 15 described above. An information processing apparatus.

更に本発明では、（イ）１編の文書を構成するテキストデータを一まとまりの文章の範囲としてのセグメントに分割してなるセグメントデータから、予め定められた所定の規則によって選出され、それぞれの文の意図を端的に表現する用語としての行為用語のいずれかに一致するテキスト情報を、各セグメントについて抽出する行為用語抽出ステップと、（ロ）この行為用語抽出ステップで抽出したそれぞれのセグメントごとの行為用語としてのセグメント別行為用語と、前記した１編の文書を構成するテキストデータについての全セグメントを統合した行為用語としての統合行為用語とを比較する行為用語比較ステップと、（ハ）この行為用語比較ステップの比較結果から前記した統合行為用語に最も似通ったセグメント別行為用語を抽出したセグメントを前記した１編の文書のテキストデータについての要部となるセグメントと判別するセグメント判別ステップとをテキスト処理方法が具備する。 Further, in the present invention, (a) text data constituting one document is selected from segment data obtained by dividing the text data into segments as a group of sentence ranges according to a predetermined rule, and each sentence is selected. An action term extraction step for extracting for each segment text information that matches one of the action terms as a term that expresses the intention of the action; and (b) an action for each segment extracted in this action term extraction step. An action term comparison step for comparing an action term for each segment as a term with an integrated act term as an act term obtained by integrating all segments of the text data constituting the one document, and (c) this act term Extracting segment-specific action terms that are most similar to the above-mentioned integrated action terms from the comparison result of the comparison step Segment determination step and a text processing method for determining become segment and main part of the text data 1 Part document that said segment comprises.

更にまた、本発明では、（イ）１編の文書を構成するテキストデータを一まとまりの文章の範囲としてのセグメントに分割するセグメント分割ステップと、（ロ）このセグメント分割ステップによってセグメントに分割してなるセグメントデータから、予め定められた所定の規則によって選出され、それぞれの文の意図を端的に表現する用語としての行為用語のいずれかに一致するテキスト情報を、各セグメントについて抽出する行為用語抽出ステップと、（ハ）この行為用語抽出ステップで抽出したそれぞれのセグメントごとの行為用語としてのセグメント別行為用語と、前記した１編の文書を構成するテキストデータについての全セグメントを統合した行為用語としての統合行為用語とを比較する行為用語比較ステップと、（ニ）この行為用語比較ステップの比較結果から前記した統合行為用語に最も似通ったセグメント別行為用語を抽出したセグメントを前記した１編の文書のテキストデータについての要部となるセグメントと判別するセグメント判別ステップとを情報処理方法が具備する。 Furthermore, in the present invention, (a) a segment division step for dividing text data constituting one document into segments as a group of sentences, and (b) dividing into segments by this segment division step. An action term extracting step for extracting, for each segment, text information that is selected from predetermined segment rules according to a predetermined rule and matches one of the action terms as a term that expresses the intention of each sentence. And (c) an action term for each segment extracted in this action term extraction step, and an action term that integrates all segments of the text data constituting the one document described above. An action term comparison step that compares the integrated action terms with (d) this line A segment discriminating step for discriminating a segment obtained by extracting a segment-specific action term most similar to the integrated action term from the comparison result of the term comparison step as a segment that is a main part of the text data of the one document. A processing method is provided.

また、本発明では、コンピュータに、テキスト処理プログラムとして、（イ）１編の文書を構成するテキストデータを一まとまりの文章の範囲としてのセグメントに分割してなるセグメントデータから、予め定められた所定の規則によって選出され、それぞれの文の意図を端的に表現する用語としての行為用語のいずれかに一致するテキスト情報を、各セグメントについて抽出する行為用語抽出処理と、（ロ）この行為用語抽出処理で抽出したそれぞれのセグメントごとの行為用語としてのセグメント別行為用語と、前記した１編の文書を構成するテキストデータについての全セグメントを統合した行為用語としての統合行為用語とを比較する行為用語比較処理と、（ハ）この行為用語比較処理の比較結果から前記した統合行為用語に最も似通ったセグメント別行為用語を抽出したセグメントを前記した１編の文書のテキストデータについての要部となるセグメントと判別するセグメント判別処理とを実行させる。 Further, according to the present invention, as a text processing program, (a) a predetermined predetermined value is obtained from segment data obtained by dividing text data constituting one document into segments as a group of sentences. An action term extraction process for extracting for each segment text information that matches one of the action terms as a term that is selected according to the rules of and expresses the intention of each sentence; and (b) this action term extraction process Action term comparison that compares the action term by segment as the action term for each segment extracted in step 1 and the integrated action term as the action term that integrates all the segments of the text data constituting the one document described above And (c) the most similar to the integrated action term described above from the comparison result of this action term comparison process To execute the segment discrimination processing for discriminating become segment and main part of the text data 1 Part documents the segments extracted segment act terms Tsu above.

更に、本発明では、コンピュータに、情報処理プログラムとして、（イ）１編の文書を構成するテキストデータを一まとまりの文章の範囲としてのセグメントに分割するセグメント分割処理と、（ロ）このセグメント分割ステップによってセグメントに分割してなるセグメントデータから、予め定められた所定の規則によって選出され、それぞれの文の意図を端的に表現する用語としての行為用語のいずれかに一致するテキスト情報を、各セグメントについて抽出する行為用語抽出処理と、（ハ）この行為用語抽出処理で抽出したそれぞれのセグメントごとの行為用語としてのセグメント別行為用語と、前記した１編の文書を構成するテキストデータについての全セグメントを統合した行為用語としての統合行為用語とを比較する行為用語比較処理と、（ニ）この行為用語比較処理による比較結果から前記した統合行為用語に最も似通ったセグメント別行為用語を抽出したセグメントを前記した１編の文書のテキストデータについての要部となるセグメントと判別するセグメント判別処理とを実行させる。 Further, according to the present invention, as a data processing program, (b) segment division processing that divides text data constituting one document into segments as a group of sentences, and (b) segment division. Text information that matches one of the action terms as a term that is selected from the segment data divided into segments by steps according to a predetermined rule and expresses the intention of each sentence. Action term extraction process for extracting the action term, (c) the segment-specific action term for each segment extracted in the action term extraction process, and all the segments for the text data constituting the one document described above Of action terms comparing action terms integrated with action terms And (d) a segment that is obtained by extracting a segment-specific action term that is most similar to the integrated action term from the comparison result of the action term comparison process, and a segment that is a main part of the text data of one document described above A segment discrimination process for discrimination is executed.

以上説明したように本発明によれば、さまざまな文書の電子化されたテストデータから、その文書の要部となり得る部分テキストを文章としての品質を高く保った状態で抽出することができる。 As described above, according to the present invention, partial text that can be a main part of a document can be extracted from digitized test data of various documents while maintaining a high quality as a sentence.

図１は、本発明のテキスト処理システムのクレーム対応図を示したものである。本発明のテキスト処理システム１０は、１編の文書を構成するテキストデータを一まとまりの文章の範囲としてのセグメントに分割してなるセグメントデータから、予め定められた所定の規則によって選出され、それぞれの文の意図を端的に表現する用語としての行為用語のいずれかに一致するテキスト情報を、各セグメントについて抽出する行為用語抽出手段１１と、この行為用語抽出手段１１で抽出したそれぞれのセグメントごとの行為用語としてのセグメント別行為用語と、前記した１編の文書を構成するテキストデータについての全セグメントを統合した行為用語としての統合行為用語とを比較する行為用語比較手段１２と、この行為用語比較手段１２の比較結果から前記した統合行為用語に最も似通ったセグメント別行為用語を抽出したセグメントを前記した１編の文書のテキストデータについての要部となるセグメントと判別するセグメント判別手段１３とを備えている。 FIG. 1 shows a claim correspondence diagram of the text processing system of the present invention. The text processing system 10 of the present invention is selected according to a predetermined rule from segment data obtained by dividing text data constituting one document into segments as a group of sentence ranges. Action term extracting means 11 for extracting each piece of text information that matches one of action terms as terms that express the intention of the sentence, and actions for each segment extracted by this action term extracting means 11 The action term comparison means 12 for comparing the segment-specific action terms as the terms with the integrated action terms as the action terms obtained by integrating all the segments of the text data constituting the one document, and the action term comparison means From the 12 comparison results, segment action terms that are most similar to the integrated action terms are extracted. And the segments and a segment discriminating unit 13 for discriminating a segment which is a main part of the text data of the document 1 Part described above.

図２は、本発明の情報処理装置のクレーム対応図を示したものである。本発明の情報処理装置２０は、１編の文書を構成するテキストデータを一まとまりの文章の範囲としてのセグメントに分割するセグメント分割手段２１と、前記した請求項１〜請求項１５いずれかに記載のテキスト処理システム２２とを備えている。 FIG. 2 is a diagram corresponding to the claims of the information processing apparatus according to the present invention. The information processing apparatus 20 according to the present invention includes a segment dividing unit 21 that divides text data constituting one document into segments as a group of sentence ranges, and the above-described one of claims 1 to 15. Text processing system 22.

図３は、本発明のテキスト処理方法のクレーム対応図を示したものである。本発明のテキスト処理方法３０は、１編の文書を構成するテキストデータを一まとまりの文章の範囲としてのセグメントに分割してなるセグメントデータから、予め定められた所定の規則によって選出され、それぞれの文の意図を端的に表現する用語としての行為用語のいずれかに一致するテキスト情報を、各セグメントについて抽出する行為用語抽出ステップ３１と、この行為用語抽出ステップ３１で抽出したそれぞれのセグメントごとの行為用語としてのセグメント別行為用語と、前記した１編の文書を構成するテキストデータについての全セグメントを統合した行為用語としての統合行為用語とを比較する行為用語比較ステップ３２と、この行為用語比較ステップ３２の比較結果から前記した統合行為用語に最も似通ったセグメント別行為用語を抽出したセグメントを前記した１編の文書のテキストデータについての要部となるセグメントと判別するセグメント判別ステップ３３とを備えている。 FIG. 3 shows a correspondence diagram of the text processing method according to the present invention. The text processing method 30 according to the present invention is selected from segment data obtained by dividing text data constituting one document into segments as a group of sentence ranges according to a predetermined rule. An action term extraction step 31 for extracting text information that matches one of the action terms as a term that expresses the intention of the sentence, and an action for each segment extracted in the action term extraction step 31 An action term comparison step 32 for comparing an action term for each segment as a term and an integrated action term as an action term obtained by integrating all segments of the text data constituting the one document described above, and this action term comparison step Based on the comparison results of 32 segments, the most similar to the integrated action term And a segment determining step 33 for discriminating the main part and comprising segments of for text data 1 Part document that the segments extracted terms.

図４は、本発明の情報処理方法のクレーム対応図を示したものである。本発明の情報処理方法４０は、１編の文書を構成するテキストデータを一まとまりの文章の範囲としてのセグメントに分割するセグメント分割ステップ４１と、このセグメント分割ステップ４１によってセグメントに分割してなるセグメントデータから、予め定められた所定の規則によって選出され、それぞれの文の意図を端的に表現する用語としての行為用語のいずれかに一致するテキスト情報を、各セグメントについて抽出する行為用語抽出ステップ４２と、この行為用語抽出ステップ４２で抽出したそれぞれのセグメントごとの行為用語としてのセグメント別行為用語と、前記した１編の文書を構成するテキストデータについての全セグメントを統合した行為用語としての統合行為用語とを比較する行為用語比較ステップ４３と、この行為用語比較ステップ４３の比較結果から前記した統合行為用語に最も似通ったセグメント別行為用語を抽出したセグメントを前記した１編の文書のテキストデータについての要部となるセグメントと判別するセグメント判別ステップ４４とを備えている。 FIG. 4 is a diagram corresponding to claims of the information processing method of the present invention. An information processing method 40 according to the present invention includes a segment division step 41 for dividing text data constituting one document into segments as a group of sentences, and a segment divided into segments by the segment division step 41. An action term extraction step 42 for extracting, for each segment, text information that is selected from data according to a predetermined rule and matches one of the action terms as a term that expresses the intention of each sentence. , The action term for each segment as the action term for each segment extracted in this action term extraction step 42, and the integrated action term as an action term that integrates all segments of the text data constituting the one document described above Action term comparison step 43 for comparing Segment determination step 44 for determining a segment obtained by extracting a segment-specific action term that is most similar to the integrated action term from the comparison result of the term comparison step 43 as a main segment of the text data of the one document. And.

図５は、本発明のテキスト処理プログラムのクレーム対応図を示したものである。本発明のテキスト処理プログラム５０は、コンピュータに、１編の文書を構成するテキストデータを一まとまりの文章の範囲としてのセグメントに分割してなるセグメントデータから、予め定められた所定の規則によって選出され、それぞれの文の意図を端的に表現する用語としての行為用語のいずれかに一致するテキスト情報を、各セグメントについて抽出する行為用語抽出処理５１と、この行為用語抽出処理５１で抽出したそれぞれのセグメントごとの行為用語としてのセグメント別行為用語と、前記した１編の文書を構成するテキストデータについての全セグメントを統合した行為用語としての統合行為用語とを比較する行為用語比較処理５２と、この行為用語比較処理５２の比較結果から前記した統合行為用語に最も似通ったセグメント別行為用語を抽出したセグメントを前記した１編の文書のテキストデータについての要部となるセグメントと判別するセグメント判別処理５３とを実行させる。 FIG. 5 shows a claim correspondence diagram of the text processing program of the present invention. The text processing program 50 of the present invention is selected by a computer according to a predetermined rule from segment data obtained by dividing text data constituting one document into segments as a group of sentences. , An action term extraction process 51 that extracts text information that matches one of the action terms as a term that expresses the intention of each sentence, and each segment extracted by the action term extraction process 51 An action term comparison process 52 for comparing an action term for each segment as an action term for each, and an integrated action term as an action term obtained by integrating all segments of the text data constituting the one document described above, and this action From the comparison result of the term comparison processing 52, the segment most similar to the integrated action term described above. To execute the segment discrimination processing 53 for determining segments extracts and specific actions term principal part become segments of the text data of the document 1 Part described above.

図６は、本発明の情報処理プログラムのクレーム対応図を示したものである。本発明の情報処理プログラム６０は、コンピュータに、１編の文書を構成するテキストデータを一まとまりの文章の範囲としてのセグメントに分割するセグメント分割処理６１と、このセグメント分割処理６１によってセグメントに分割してなるセグメントデータから、予め定められた所定の規則によって選出され、それぞれの文の意図を端的に表現する用語としての行為用語のいずれかに一致するテキスト情報を、各セグメントについて抽出する行為用語抽出処理６２と、この行為用語抽出処理６２で抽出したそれぞれのセグメントごとの行為用語としてのセグメント別行為用語と、前記した１編の文書を構成するテキストデータについての全セグメントを統合した行為用語としての統合行為用語とを比較する行為用語比較処理６３と、この行為用語比較処理６３による比較結果から前記した統合行為用語に最も似通ったセグメント別行為用語を抽出したセグメントを前記した１編の文書のテキストデータについての要部となるセグメントと判別するセグメント判別処理６４とを実行させる。 FIG. 6 shows a claim correspondence diagram of the information processing program of the present invention. The information processing program 60 according to the present invention divides the text data constituting one document into segments as segments of a group of sentences, and a segment division process 61 for dividing the segment into segments. Action term extraction that extracts text information for each segment that is selected from predetermined segment data according to a predetermined rule and matches any action term as a term that expresses the intention of each sentence. As the action term that integrates all the segments of the text data constituting the one document and the action term for each segment extracted in the action 62, the action term extracting process 62 The action term comparison process 63 for comparing the integrated action terms and this Segment discriminating process 64 for discriminating a segment obtained by extracting the segment-specific action term most similar to the integrated act term from the comparison result of the act term comparing process 63 as a segment that is a main part of the text data of the one document. And execute.

＜発明の第１の実施の形態＞ <First Embodiment of the Invention>

次に本発明の第１の実施の形態を説明する。 Next, a first embodiment of the present invention will be described.

図７は、本発明の第１の実施の形態によるテキスト処理システムを使用した情報処理装置の構成を表わしたものである。この情報処理装置１００は、ＣＰＵ（Central Processing Unit）１０１および制御プログラムをその少なくとも一部に格納するメモリ１０２を備えた制御部１０３を有している。制御部１０３は、テキスト処理システム１１０を構成する次に説明する各部の制御を行うようになっている。 FIG. 7 shows the configuration of the information processing apparatus using the text processing system according to the first embodiment of the present invention. The information processing apparatus 100 includes a control unit 103 including a CPU (Central Processing Unit) 101 and a memory 102 that stores a control program in at least a part thereof. The control unit 103 controls each unit described below that constitutes the text processing system 110.

文書集合部１０５は、所定数の文書データを電子的なデータとして蓄積する。セグメント化部１０６は文書集合部１０５から読み出した所望の文書のテキストデータである文書データ１０７を節やパラグラフ等の一まとまりの文章の範囲としてのセグメントに分割し、これらのテキストデータをセグメントデータ１０８として、テキスト処理システム１１０に入力するようになっている。テキスト処理システム１１０は、最適なセグメントとしての核セグメント１１１を出力部１１２に供給して外部に送出する。ここで核セグメント１１１とは、処理の対象となる文書全体とたとえば類似度が最も高い値を持つセグメントをいう。出力部１１２は核セグメント１１１を外部に出力するようになっている。 The document collection unit 105 accumulates a predetermined number of document data as electronic data. The segmentation unit 106 divides the document data 107, which is text data of a desired document read from the document collection unit 105, into segments as a range of sentences such as sections and paragraphs, and the text data is segmented into segment data 108. Are input to the text processing system 110. The text processing system 110 supplies a nucleus segment 111 as an optimum segment to the output unit 112 and sends it to the outside. Here, the core segment 111 refers to a segment having the highest similarity value with the entire document to be processed. The output unit 112 outputs the nuclear segment 111 to the outside.

ここで本実施の形態のテキスト処理システム１１０は、入力されたセグメントデータ１０８から行為用語を抽出する行為用語抽出部１２１を備えている。本明細書で「行為用語」とは、セグメントに分かれたそれぞれのセグメントデータ１０８が表わす端的な意味をいう。行為用語抽出部１２１の抽出した行為用語１２２は、行為ベクトル生成部１２３に供給される。行為ベクトル生成部１２３は、各セグメントごとに抽出した行為用語のベクトルと全セグメントのベクトルを総和した統合ベクトル１２４を生成するようになっている。 Here, the text processing system 110 of this embodiment includes an action term extraction unit 121 that extracts an action term from the input segment data 108. In this specification, the term “action term” means a simple meaning represented by each segment data 108 divided into segments. The action term 122 extracted by the action term extraction unit 121 is supplied to the action vector generation unit 123. The action vector generation unit 123 generates an integrated vector 124 in which the action term vector extracted for each segment and the vectors of all segments are summed.

行為ベクトル生成部１２３で生成された統合ベクトル１２４と各セグメントの行為ベクトル１２５は、行為ベクトル記憶部１２６に供給されて格納される。核セグメント判定部１２８は、行為ベクトル記憶部１２６内に格納された各セグメントの行為ベクトルと統合ベクトルを比較データ１２９として比較して、最適なセグメントを特定し、核セグメント１１１として出力するようになっている。 The integrated vector 124 generated by the action vector generation unit 123 and the action vector 125 of each segment are supplied to and stored in the action vector storage unit 126. The nuclear segment determination unit 128 compares the action vector of each segment stored in the action vector storage unit 126 with the integrated vector as comparison data 129, identifies the optimum segment, and outputs it as the nuclear segment 111. ing.

このようなテキスト処理システム１１０内の行為用語抽出部１２１等の各構成部の少なくとも一部は、メモリ１０２に格納された制御プログラムをＣＰＵ１０１が実行することによってソフトウェア的なデバイスとして実現することができる。また、本実施の形態ではテキスト処理システム１１０を情報処理装置１００の一部として構成しているが、これ以外の構成であってもよい。たとえば文書集合部１０５、セグメント化部１０６および出力部１１２が図示しない通信ネットワークを介して他の図示しない情報処理装置側に存在するものであってもよい。更に、文書集合部１０５は１つのデータベースとして構成されている必要性はなく、複数のデータベースに分散して存在していても構わない。 At least a part of each component such as the action term extraction unit 121 in the text processing system 110 can be realized as a software device by the CPU 101 executing a control program stored in the memory 102. . In the present embodiment, the text processing system 110 is configured as a part of the information processing apparatus 100, but other configurations may be used. For example, the document aggregation unit 105, the segmentation unit 106, and the output unit 112 may exist on another information processing apparatus (not shown) via a communication network (not shown). Furthermore, the document collection unit 105 does not have to be configured as one database, and may be distributed in a plurality of databases.

次に、本実施の形態の情報処理装置１００の動作について、テキスト処理システム１１０を中心に説明する。 Next, the operation of the information processing apparatus 100 according to the present embodiment will be described focusing on the text processing system 110.

文書集合部１０５には、人が作成した電子化された文書またはそれらのコピーが蓄積されている。処理の対象となる電子化された文書は、それぞれ１つずつの話題について記述されたものであることが望ましい。これは文書集合部１０５に格納する文書の種類を強く制限する制約ではない。１つの文書が、全く異なる話題について書かれた複数の文書を結合したものは、テキスト処理システム１１０の処理の対象とする文書として好ましくない、という程度の制約である。 The document collection unit 105 stores digitized documents created by humans or copies thereof. It is desirable that each digitized document to be processed is a description of one topic. This is not a restriction that strongly restricts the types of documents stored in the document collection unit 105. One document is a combination of a plurality of documents written on completely different topics, which is a limitation that it is not preferable as a document to be processed by the text processing system 110.

また、文書集合部１０５に蓄積されている文書は、どのような文書作成アプリケーションソフトウェアで作成され、また、どのようなアプリケーション形式で電子化されているかは問題とされない。文書集合部１０５から文書のテキストを文書データ１０７として取り出す際には、該当するアプリケーション形式からテキストを抜き出す既存の技術を用いることができるからである。 Further, it does not matter what kind of document creation application software the document stored in the document collection unit 105 is created and what kind of application format it is digitized. This is because when extracting the text of the document from the document collection unit 105 as the document data 107, an existing technique for extracting the text from the corresponding application format can be used.

本実施の形態では、処理の対象となる文書が学術論文である場合を例に挙げて説明する。学術論文はテキスト処理システム１１０の処理できる文書の一例であることは当然である。 In this embodiment, a case where a document to be processed is an academic paper will be described as an example. Naturally, an academic paper is an example of a document that can be processed by the text processing system 110.

セグメント化部１０６は、文書集合部１０５に蓄積されている文書またはこれらの文書から抜き出されたテキストデータを文書データ１０７として入力し、セグメントに分割する。ここでセグメントとは、パラグラフや節等の「部分」あるいは「断片」をいう。特殊な例としては、プレゼンテーションソフトウェアの文書におけるスライドもセグメントの１つとなる。セグメントは、いずれの単位のものでも構わないが、対象文書の種類に応じて、「単位」を変化させることも可能である。本実施の形態では、一例としてセグメント化部１０６が文書データ１０７を「節」に分割するものとして説明を行う。 The segmentation unit 106 inputs the document stored in the document collection unit 105 or text data extracted from these documents as document data 107 and divides it into segments. Here, the segment means a “part” or “fragment” such as a paragraph or a clause. As a special example, a slide in a presentation software document is also a segment. The segment may be in any unit, but the “unit” can be changed according to the type of the target document. In the present embodiment, as an example, the description will be made assuming that the segmentation unit 106 divides the document data 107 into “sections”.

図８は、セグメント化部で分割する前の文書とセグメント化した文書を表わしたものである。ここで同図（Ａ）は図７に示した文書集合部１０５に蓄積されている所定の文書（元の文書）１３１を示している。同図（Ｂ）は、文書１３１のテキストデータとしての文書データ１０７を表わしている。この文書データ１０７は、この例で、破線で囲んだ第１〜第５のセグメント１３２₁〜１３２₅に分割される。このようにして分割された第１〜第５のセグメント１３２₁〜１３２₅は、図７に示したセグメントデータ１０８として、図７に示したテキスト処理システム１１０に入力される。 FIG. 8 shows a document before segmentation by the segmentation unit and a segmented document. Here, FIG. 9A shows a predetermined document (original document) 131 stored in the document collection unit 105 shown in FIG. FIG. 5B shows document data 107 as text data of the document 131. The document data 107, in this example, is divided into segments 132 _1-132 ₅ of the first to fifth surrounded by a broken line. Segment 132 _1-132 ₅ of the first to fifth divided in this way, as the segment data 108 shown in FIG. 7, is input to the text processing system 110 shown in FIG.

テキスト処理システム１１０では、図７に示した文書集合部１０５に含まれるすべての文書を順に処理してもよいし、必要な文書のみを順に処理してもよい。また、本テキスト処理システムの利用者が指定する文書のみを順に処理してもよい。 In the text processing system 110, all the documents included in the document collection unit 105 shown in FIG. 7 may be processed in order, or only necessary documents may be processed in order. Further, only the document designated by the user of the text processing system may be processed in order.

テキスト処理システム１１０の初段に配置された行為用語抽出部１２１は、文書の各セグメントを表わすセグメントデータ１０８から、これらのセグメントがそれぞれ表わす端的な意味を持った用語を抽出する。ここで、端的な意味を表わす用語とは、対象となる文書の言語が日本語の場合、文の最後尾に位置する特定の動詞または特定の名詞である。 The action term extraction unit 121 arranged in the first stage of the text processing system 110 extracts terms having simple meanings respectively represented by these segments from the segment data 108 representing each segment of the document. Here, the term representing the simple meaning is a specific verb or a specific noun located at the end of the sentence when the language of the target document is Japanese.

特定の動詞の例としては、「ある」や「する」等の特定の例外の自立動詞を除いた自立動詞がある。また、特定の名詞の例としては、一般名詞およびサ変接続名詞および形容動詞の語幹になる名詞（形容動詞語幹名詞）がある。対象となる文書の言語が日本語の場合、文の最後尾に位置する特定の動詞または特定の名詞を、行為用語と呼ぶ。図８に示した行為用語は単なる一例である。行為用語は、判定する核セグメントの種類に応じて変更してもよい。ここで、「核セグメント」は、類似度または出現頻度が予め設定した閾値以上であるセグメントをいう。本実施の形態では、説明の煩雑化を防ぐため、一つの行為用語を抽出する場合について説明するが、複数個の行為用語を抽出してもよい。 Examples of specific verbs include independent verbs excluding specific exceptions such as “Yes” and “Yes”. As examples of specific nouns, there are general nouns, savory connection nouns, and nouns that form stems of adjective verbs (adjective verb stem nouns). When the language of the target document is Japanese, a specific verb or a specific noun located at the end of the sentence is called an action term. The action terms shown in FIG. 8 are merely examples. The action term may be changed according to the type of the nuclear segment to be determined. Here, “nuclear segment” refers to a segment whose similarity or appearance frequency is greater than or equal to a preset threshold value. In the present embodiment, a case where one action term is extracted is described in order to prevent the explanation from becoming complicated, but a plurality of action terms may be extracted.

行為ベクトル生成部１２３は、行為用語抽出部１２１によってセグメントごとに抽出された行為用語１２２を用いて各セグメントの行為用語のベクトルを生成する。また、すべてのセグメントのベクトルを生成したら、これらのベクトルの総和のベクトルである統合ベクトルを生成する。統合ベクトルを生成する際には、各セグメントのベクトルの大きさを１に正規化した上で総和を求めてもよい。 The action vector generation unit 123 generates an action term vector for each segment using the action term 122 extracted for each segment by the action term extraction unit 121. Further, when vectors of all segments are generated, an integrated vector that is a vector of the sum of these vectors is generated. When the integrated vector is generated, the sum may be obtained after normalizing the vector size of each segment to 1.

行為ベクトル記憶部１２６は、行為ベクトル生成部１２３によって生成された統合ベクトル１２４と各セグメントの行為ベクトル１２５を記憶する。 The action vector storage unit 126 stores the integrated vector 124 generated by the action vector generation unit 123 and the action vector 125 of each segment.

核セグメント判定部１２８は、行為ベクトル記憶部１２６に記憶されている各セグメントの行為ベクトルと統合ベクトルを順に比較データ１２９として比較して最適なセグメントを核セグメント１１１とする。ここで、最適なセグメントを核セグメント１１１として特定する手法には各種の方式を採用することができる。たとえば、統合ベクトル１２４と最もベクトルが近い行為ベクトル１２５を選択する類似度方式や、統合ベクトル１２４での頻度の高い行為用語をたくさん含む行為ベクトル１２５を選択する頻度方式がある。類似度方式では、各セグメントの行為ベクトル１２５と統合ベクトル１２４の間の類似度計算にコサイン尺度を用いればよい。頻度方式では、統合ベクトル１２４での頻度が上位の行為用語が各セグメントの行為ベクトル１２５に含まれている確率を用いればよい。 The nuclear segment determination unit 128 compares the action vector and the integrated vector of each segment stored in the action vector storage unit 126 in order as the comparison data 129 and sets the optimum segment as the nuclear segment 111. Here, various methods can be adopted as a method for specifying the optimum segment as the core segment 111. For example, there are a similarity method for selecting an action vector 125 that is closest to the integrated vector 124, and a frequency method for selecting an action vector 125 that includes many action terms that are frequently used in the integrated vector 124. In the similarity method, a cosine measure may be used for calculating the similarity between the action vector 125 and the integrated vector 124 of each segment. In the frequency method, a probability that an action term having a higher frequency in the integrated vector 124 is included in the action vector 125 of each segment may be used.

以上のようにしてテキスト処理システム１１０が文書集合部１０５内の該当する文書すべてについての処理を完了したら、出力部１１２がこれらの文書について供給された核セグメント１１１を結果として出力することになる。テキスト処理システム１１０が文書集合部１０５内のすべての文書を指定して処理し、出力部１１２がこれら文書全体の結果を出力してもよい。 When the text processing system 110 completes the processing for all the corresponding documents in the document collection unit 105 as described above, the output unit 112 outputs the core segments 111 supplied for these documents as a result. The text processing system 110 may designate and process all the documents in the document collection unit 105, and the output unit 112 may output the results of these documents as a whole.

このような本実施の形態のテキスト処理システム１１０によれば、さまざまな文書から、これらの文書の要約となりうる文章として品質の高い部分テキストを抽出することができるという効果がある。文章としての品質が高い理由は、文書中のセグメントを選択するためである。また、本実施の形態で使用するセグメントは文の間のつながりが自然であり、また、それぞれのセグメントはそれぞれ単一のトピックについて記述されていることが多く、文章としてまとまっている傾向があるからである。本実施の形態のテキスト処理システム１１０がさまざまな文書に対応できる理由は、辞書やルールといった手法を必要としない手法であるためである。また、文書の要部の抽出に小見出しを使わず、文書全体で用いられる行為用語と類似した行為用語を用いているセグメントまたは文書全体で頻度の高い行為用語を多く使っているセグメントを選択するようにしているからである。 According to the text processing system 110 of this embodiment as described above, there is an effect that a high-quality partial text can be extracted from various documents as a sentence that can be a summary of these documents. The reason why the quality as a sentence is high is to select a segment in the document. In addition, the segments used in this embodiment have a natural connection between sentences, and each segment is often described as a single topic and tends to be organized as sentences. It is. The reason why the text processing system 110 according to the present embodiment can cope with various documents is that it does not require a method such as a dictionary or a rule. Also, instead of using subheadings to extract the main part of the document, select segments that use action terms similar to those used throughout the document, or segments that use many frequent action terms throughout the document. Because it is.

図９は、本発明の第１の実施例におけるテキスト処理システムを使用した情報処理装置の構成を表わしたものである。この図９に示した本実施例の情報処理装置１００Ａにおけるテキスト処理システム１１０Ａで、図７と同一部分には同一の符号を付しており、これらの説明を適宜省略する。 FIG. 9 shows the configuration of an information processing apparatus using the text processing system in the first embodiment of the present invention. In the text processing system 110A in the information processing apparatus 100A of the present embodiment shown in FIG. 9, the same parts as those in FIG. 7 are denoted by the same reference numerals, and description thereof will be omitted as appropriate.

第１の実施例のテキスト処理システム１１０Ａは、行為用語抽出部１２１Ａが第１〜第Ｍの行為用語リスト１４１₁〜１４１_Mを備えている。ここで符号Ｍは、１つのセグメントとしての文章に含まれる可能性のある文の総数として予想される値の上限値あるいはこれよりも大きな整数である。また、第１の実施例のテキスト処理システム１１０Ａの行為ベクトル生成部１２３Ａは、第１〜第Ｍの行為ベクトル１４２₁〜１４２_Mと、文書全体の統合ベクトル１４３を生成するようになっている。制御部１０３Ａ内のメモリ１０２Ａには、第１の実施例における情報処理装置１０１Ａの制御を行う制御プログラムが格納されている。第１〜第Ｍの行為用語リスト１４１₁〜１４１_M、第１〜第Ｍの行為ベクトル１４２₁〜１４２_Mおよび統合ベクトル１４３については、後に説明する。 In the text processing system 110A of the first example, the action term extraction unit 121A includes _{first to} Mth action term lists 141 _{1 to} 141 _M. Here, the symbol M is an upper limit value expected as the total number of sentences that may be included in a sentence as one segment, or an integer larger than this. In addition, the action vector generation unit 123A of the text processing system 110A of the first embodiment generates the first to Mth action vectors 142 _{1 to} 142 _M and the integrated vector 143 of the entire document. A control program for controlling the information processing apparatus 101A in the first embodiment is stored in the memory 102A in the control unit 103A. The first to Mth action term lists 141 _{1 to} 141 _M , the first to Mth action vectors 142 _{1 to} 142 _M and the integrated vector 143 will be described later.

この第１の実施例で文書集合部１０５には、文書１３１の一例として図８（Ａ）に示すような学術論文が、１編だけ、同図（Ｂ）に示すテキストデータとしての文書データ１０７として蓄積されているものとする。既に説明したように、任意のセグメントに分割できる自然言語で書かれた一まとまりの電子化された文書データ１３１であれば、学術論文以外の文書もテキスト処理システム１１０の対象となる。 In the first embodiment, the document collection unit 105 includes only one academic paper as shown in FIG. 8A as an example of the document 131, and document data 107 as text data shown in FIG. It is assumed that it is accumulated as. As described above, the text processing system 110 also includes documents other than academic papers as long as they are a group of digitized document data 131 written in a natural language that can be divided into arbitrary segments.

セグメント化部１０６は、図８（Ｂ）に示すように文書データ１０７を破線で囲んだ第１〜第５のセグメント１３２₁〜１３２₅に分割する。このようなセグメント化部１０６の分割処理は、節番号を基にしたり、２行以上の改行によって文書をセグメントに分割する既存の方法を用いることができる。本実施例では、第１〜第５のセグメント１３２₁〜１３２₅をそれぞれ破線で囲んで示しているが、実際にはＸＭＬ（Extensible Markup Language）等の構造化言語を用いて、それぞれのセグメントの範囲を示すことができる。 Segmenting unit 106 divides the first to fifth segment 132 _1-132 ₅ surrounding the document data 107 with a broken line as shown in FIG. 8 (B). Such segmentation processing of the segmentation unit 106 can use an existing method based on a section number or segmenting a document into segments by two or more line breaks. In the present embodiment, the first to fifth segments 132 _{1 to} 132 ₅ are surrounded by broken lines, but in reality, a structured language such as XML (Extensible Markup Language) is used to identify each segment. A range can be indicated.

テキスト処理システム１１０は、セグメント化部１０６を介して文書集合部１０５から学術論文を１編ずつ取り出して以降の処理を行う。本実施例では図８（Ａ）に示す文書データ１３１についてそのセグメントデータ１０８がテキスト処理システム１１０に取り込まれる。文書集合部１０５に複数の文書１３１についての文書データ１０７が格納されている場合には、１編ずつ処理が繰り返されて、これら複数の文書を処理すればよい。 The text processing system 110 extracts academic papers one by one from the document collection unit 105 via the segmentation unit 106 and performs subsequent processing. In this embodiment, the segment data 108 of the document data 131 shown in FIG. When document data 107 for a plurality of documents 131 is stored in the document collection unit 105, the processing may be repeated one by one to process the plurality of documents.

本実施例で取り扱う文書１３１は、一例を挙げると、マイクロソフトワード（登録商標）に代表されるワープロソフトによる保存形式であってもよい。また、アドビシステムズ社の開発したビューアーソフトであるアクロバットリーダ（登録商標）に適用されるＰＤＦ（Portable Document Format）という保存形式であってもよい。文書１３１は他の保存形式のものであってもよいことはもちろんである。 For example, the document 131 handled in the present embodiment may be stored in a word processor software represented by Microsoft Word (registered trademark). Also, a storage format called PDF (Portable Document Format) applied to Acrobat Reader (registered trademark), which is viewer software developed by Adobe Systems, may be used. Of course, the document 131 may be in other storage formats.

図９には示していないが、各種の保存形式で保存した文書データ１０７を文書集合部１０５から取り出す際には、各種の保存形式で格納された文書からテキスト情報を抜き出す既存の技術を用いることができる。 Although not shown in FIG. 9, when extracting the document data 107 stored in various storage formats from the document collection unit 105, an existing technique for extracting text information from a document stored in various storage formats is used. Can do.

図１０は、本実施例のテキスト処理システムにおける行為用語抽出部の処理の様子を表わしたものである。図１０に示す処理は、図９に示したメモリ１０２Ａに格納された制御プログラムをＣＰＵ１０１が実行することによって実現する。図８および図９と共に説明する。 FIG. 10 shows a state of processing of the action term extraction unit in the text processing system of the present embodiment. The processing shown in FIG. 10 is realized by the CPU 101 executing the control program stored in the memory 102A shown in FIG. This will be described with reference to FIGS.

まず、行為用語抽出部１２１Ａはセグメント化部１０６から１編の文書データ１０７が第１〜第５のセグメント１３２₁〜１３２₅に分割された場合におけるその文書全体を読み込む（ステップＳ２０１）。そして処理の対象となるセグメント１３２が未処理で残っているか、すなわちセグメントデータ１０８が未処理状態で残っているかどうかをチェックする（ステップＳ２０２）。行為用語抽出部１２１Ａが第１〜第５のセグメント１３２₁〜１３２₅について何らの処理も行っていない現在の状態では（Ｙ）、今回処理の対象となる第１のセグメント１３２₁を構成する文章を文に分割する（ステップＳ２０３）。文章を文という単位で分割するためには、第１のセグメント１３２₁のセグメントデータ１０８としてのテキストデータの中から読点やピリオドを探し出し、その個所で文を分割する従来の方法を使えばよい。 First, the act term extraction unit 121A reads the entire document in the case where the document data 107 from the segmentation unit 106 1 Part is divided into segments 132 _1-132 ₅ of the first to fifth (step S201). Then, it is checked whether the segment 132 to be processed remains unprocessed, that is, whether the segment data 108 remains unprocessed (step S202). In the current state in which the action term extraction unit 121A does not perform any processing for the _first to fifth segments 132 _{1 to} 132 ₅ (Y), the sentences constituting the _first segment 132 ₁ to be processed this time Is divided into sentences (step S203). In order to divide a sentence into units of sentences, a conventional method of finding a punctuation mark or a period from text data as segment data 108 of the _first segment 1321 and dividing the sentence at that position may be used.

第１のセグメント１３２₁を構成する文章から先頭の１つの文（のテキストデータ）を分割によって取り出すと、この１つの文（のテキストデータ）から行為用語を特定する（ステップＳ２０４）。行為用語の特定方法は後に説明する。本実施例では一文から抽出する行為用語を一単語とするが、二単語以上でも構わない。また抽出する行為用語は単語ではなく複合語や句でも構わない。行為用語を特定すると、これを第１のセグメント１３２₁について用意された第１の行為用語リスト１４１₁に記入する（ステップＳ２０５）。 Upon removal by splitting the head of one sentence (text data) from the text which constitutes the first segment 132 _1, to identify the action terms from this one sentence (text data) (step S204). A method for identifying the action term will be described later. In this embodiment, the action term extracted from one sentence is one word, but two or more words may be used. The action terms to be extracted may be compound words or phrases instead of words. When the action term is specified, it is entered in the _first action term list 141 ₁ prepared for the first segment 132 ₁ (step S205).

このようにして第１のセグメント１３２₁の第１の文についての行為用語を第１のセグメント１３２₁について用意された第１の行為用語リスト１４１₁に格納したら、行為用語抽出部１２１Ａは第１のセグメント１３２₁に残りの文が存在するかをチェックする（ステップＳ２０６）。残りの文がある場合には（Ｙ）、ステップＳ２０３に戻って、第１のセグメント１３２₁における残りの文章から第２の文を抽出（分割）する。そして、この第２の文を基にして行為用語を特定し（ステップＳ２０４）、第１のセグメント１３２₁について用意された第１の行為用語リスト１４１₁にこの第２の文の行為用語を追加して格納する（ステップＳ２０５）。 Once this way by storing the act terms for the first segment 132 of the _first sentence in the first act term list 141 _1, which is prepared for the first segment 132 _1, the act term extraction unit 121A first It is checked whether or not there is a remaining sentence in the segment 132 ₁ (step S206). If there is a remaining sentence (Y), the process returns to step S203, the from the rest of the sentence in the first segment 132 ₁ extracts the second sentence (split). Then, the action term is specified based on the second sentence (step S204), and the action term of the second sentence is added to the first action term list 141 ₁ prepared for the first segment 132 _1. (Step S205).

以下、同様にして、たとえば第１のセグメント１３２₁のセグメントデータ１０８に４つの文のテキストデータが存在していた場合には、これら４つの文のテキストデータそれぞれから行為用語が特定されて第１のセグメント１３２₁について用意された第１の行為用語リスト１４１₁にこれらが格納される（ステップＳ２０５）。この後、ステップＳ２０６に進むと、第１のセグメント１３２₁には文が残っていないことが判明する（Ｎ）。そこで、この場合には、処理がステップＳ２０２に進む。 Hereinafter, similarly, for example, when there are four sentence text data in the segment data 108 of the _first segment 1321, the action terms are specified from the text data of these four sentences, and the first These are stored in the _first action term list 141 ₁ prepared for the segment 132 ₁ (step S205). Thereafter, the process proceeds to step S206, the first segment 132 ₁ to find that no remaining sentence (N). Therefore, in this case, the process proceeds to step S202.

ステップＳ２０２では、セグメントデータ１０８について未処理のセグメントが存在するかのチェックが行われる。本実施例ではセグメントデータ１０８が第１〜第５のセグメント１３２₁〜１３２₅を有している。今、第１のセグメント１３２₁の処理が終了したので、まだ第２のセグメント１３２₂以降のセグメントが残っている（Ｙ）。そこで、今度は第２のセグメント１３２₂について第１の文が分割される（ステップＳ２０３）。そして、この第１の文について行為用語が特定される（ステップＳ２０４）。この行為用語は、第２のセグメント１３２₁について用意された第２の行為用語リスト１４１₂に格納される（ステップＳ２０５）。 In step S202, it is checked whether there is an unprocessed segment in the segment data 108. In the present embodiment, the segment data 108 includes first to fifth segments 132 _{1 to} 132 ₅ . Now, since the processing of the first segment 132 ₁ has been completed, the segments after the _second segment 132 ₂ still remain (Y). Therefore, this time the first sentence is divided for the second segment 132 ₂ (step S203). Then, an action term is specified for the first sentence (step S204). This action term is stored in the _second action term list 141 ₂ prepared for the second segment 132 ₁ (step S205).

以下、同様にして第２のセグメント１３２₂についても、その中の全部の文について１つずつ行為用語が特定され（ステップＳ２０４）、第２のセグメント１３２₂について用意された第２の行為用語リスト１４１₂にこれらが格納される（ステップＳ２０５）。この後、ステップＳ２０６に進むと、第２のセグメント１３２₂には文が残っていないことが判明する（Ｎ）。そこで、この場合には、処理がステップＳ２０２に進む。 Similarly, for the second segment 132 ₂ , action terms are specified one by one for all sentences in the second segment 132 ₂ (step S 204), and the second action term list prepared for the second segment 132 _{2 is} used. These are stored in 141 ₂ (step S205). Thereafter, the process proceeds to step S206, the second segment 132 ₂ to find that no remaining sentence (N). Therefore, in this case, the process proceeds to step S202.

このようにして第５のセグメント１３２₅まで同様の処理が終了すると、第１〜第５のセグメント１３２₁〜１３２₅のそれぞれについて用意された第１〜第５の行為用語リスト１４１₁〜１４１₅には、該当するセグメントごとの行為用語がリストアップされることになる。この時点で、処理はステップＳ２０２に戻るが、この例ではセグメントデータ１０８の後に続く残りのセグメントが存在しない（Ｎ）。そこで、この時点で行為用語抽出部１２１Ａの処理が終了する（エンド）。 In this manner, when the fifth processing similar to the segments 132 ₅ of ends, first to fifth action terms list 141 _1-141 ₅ that is prepared for each of the first to fifth segment 132 _1-132 ₅ Will list the action terms for each relevant segment. At this point, the process returns to step S202, but there is no remaining segment following the segment data 108 in this example (N). Therefore, at this point, the processing of the action term extraction unit 121A ends (end).

図１１は、図１０のステップＳ２０４で示した行為用語を特定する処理の様子を表わしたものである。図１１に示す処理は、図９に示したメモリ１０２Ａに格納された制御プログラムをＣＰＵ１０１が実行することによって実現する。図８〜図１０と共に説明する。 FIG. 11 shows a process of specifying the action term shown in step S204 of FIG. The processing shown in FIG. 11 is realized by the CPU 101 executing the control program stored in the memory 102A shown in FIG. This will be described with reference to FIGS.

一つの文から行為用語を特定するために、行為用語抽出部１２１Ａは、まず対象の一つの文を形態素解析する（ステップＳ２２１）。形態素解析は既知の技術を用いればよい。たとえば、その一つの文が「テキスト文書から自動的に要約や主張箇所の文章を作成することへの要求が高まっている。」という記述だとする。 In order to identify an action term from one sentence, the action term extraction unit 121A first performs a morphological analysis on one target sentence (step S221). A known technique may be used for the morphological analysis. For example, suppose that one sentence is a statement that “the demand for automatically creating summaries and asserted sentences from text documents is increasing”.

図１２は、形態素解析の結果を示したものである。この形態素解析の結果を示した表で、表層語とは、対象文の中での活用済みの部分文字列であり、基本形とは表層語の活用基本形である。行為用語抽出部１２１Ａは、形態素解析結果の形態素数をカウンタｉに代入する（図１１ステップＳ２２２）。 FIG. 12 shows the result of morphological analysis. In the table showing the result of this morphological analysis, the surface word is a partial character string already used in the target sentence, and the basic form is a basic form of using the surface word. The action term extraction unit 121A substitutes the morpheme number of the morpheme analysis result for the counter i (step S222 in FIG. 11).

この図１２で特定しようとする行為用語は、処理の対象となる文における端的な意味を表わす用語である。この分の対象となる文書の言語が日本語の場合、行為用語は、文の最後尾に位置する特定の動詞または特定の名詞となる。このような行為用語を迅速に特定するため、最後尾の形態素から順に品詞をチェックする。形態素とは、これ以上に細かくすると意味がなくなってしまう最小の文字列である。このチェックのために、カウンタｉは最後尾の形態素を示す番号となっている。 The action term to be specified in FIG. 12 is a term representing a simple meaning in the sentence to be processed. When the language of the target document is Japanese, the action term is a specific verb or a specific noun located at the end of the sentence. In order to quickly identify such action terms, the part of speech is checked in order from the last morpheme. A morpheme is a minimum character string that is meaningless if it is made finer than this. For this check, the counter i is a number indicating the last morpheme.

図１１に戻って説明する。カウンタｉは１ずつ減少することになるが、まず、現在の値「ｉ」が値「０」よりも大きいかのチェックが行われる（ステップＳ２２３）。これは、その一つの文の先頭までの処理が終了したかを判別するためである。最初は現在の値「ｉ」が値「０」よりも大きい（Ｙ）。そこで行為用語抽出部１２１Ａは、このｉ番目の品詞が行為用語となる条件に合致しているかチェックする（ステップＳ２２４）。合致していれば（Ｙ）、そのｉ番目の品詞の基本形を行為用語と特定して（ステップＳ２２５）、一連の処理を終了する（エンド）。 Returning to FIG. The counter i is decremented by one. First, it is checked whether the current value “i” is larger than the value “0” (step S223). This is to determine whether the processing up to the beginning of the one sentence has been completed. Initially, the current value “i” is greater than the value “0” (Y). Therefore, the action term extraction unit 121A checks whether or not the i-th part of speech matches a condition for becoming an action term (step S224). If they match (Y), the basic form of the i-th part of speech is specified as an action term (step S225), and a series of processing ends (end).

この最後尾の形態素が行為用語としての条件に合致していなかった場合には（ステップＳ２２４：Ｎ）、現在の値「ｉ」を「１」だけ減算して、その文の中で注目する形態素を先頭方向に１つだけ移動させる（ステップＳ２２６）。そして、ステップＳ２２３に戻って、現在の値「ｉ」が値「０」よりも大きいかチェックする。このようにして、現在の値「ｉ」が値「０」よりも大きい間は、ステップＳ２２４でｉ番目の品詞の基本形を行為用語と特定されるまで、同様の処理が繰り返されることになる。ステップＳ２２３で現在の値「ｉ」が値「０」以下となった場合には（Ｎ）、その一つの文には行為用語が見当たらなかったことになる（ステップＳ２２７）。そこで、この場合には該当する行為用語が「なし」ということで一連の処理が終了する（エンド）。 If the last morpheme does not meet the condition as the action term (step S224: N), the current value “i” is subtracted by “1”, and the morpheme to be noticed in the sentence is displayed. Is moved by one in the head direction (step S226). Then, the process returns to step S223 to check whether the current value “i” is larger than the value “0”. In this manner, while the current value “i” is larger than the value “0”, the same processing is repeated until the basic form of the i-th part of speech is identified as the action term in step S224. If the current value “i” becomes equal to or less than “0” in step S223 (N), it means that no action term was found in that one sentence (step S227). Therefore, in this case, a series of processing ends because the corresponding action term is “none” (end).

ところで、ステップＳ２０４で行為用語と判断する条件は、現在チェックしている形態素（ｉ番目）の品詞が特定の動詞または特定の名詞であることである。本実施例では、特定の動詞の例として、「ある」や「する」等の特定の例外の自立動詞を除く自立動詞とし、特定の名詞の例として、一般名詞およびサ変接続名詞および形容動詞語幹名詞とする。例外とする自立動詞には、「ある」や「する」の他に、たとえば「いる」、「でる」、「なる」、「よる」、「みる」、「やる」、「できる」、「いう」、「行なう」、「言う」、「つく」がある。図１２に示す形態素解析結果で示すと、ステップＳ２２４で行為用語としての条件に合致するものは、番号「１」、「２」、「４」、「６」、「８」、「９」、「１１」、「１３」、「１８」および「２０」の各形態素となる。 By the way, the condition for determining an action term in step S204 is that the morpheme (i-th) part of speech currently being checked is a specific verb or a specific noun. In this example, as specific verbs, self-verb, excluding specific exceptions such as “Yes” and “Yes”, are used, and as examples of specific nouns, general nouns, sa-variant connected nouns and adjective verb stems It is a noun. Independent verbs that are exceptions include, for example, “is”, “de”, “be”, “become”, “see”, “do”, “do”, “do” ”,“ Do ”,“ Say ”, and“ Take ”. As shown in the morphological analysis results shown in FIG. 12, the numbers “1”, “2”, “4”, “6”, “8”, “9”, The morphemes are “11”, “13”, “18”, and “20”.

ところで、図１２に示した処理を実行した場合、文ごとに行為用語を形態素の最後尾から順にチェックすることにしている。このため、この例で最初に見つかる条件に合致する用語は番号「２０」のものであり、この形態素の品詞は「自立動詞」である。 By the way, when the process shown in FIG. 12 is executed, the action terms are checked in order from the tail of the morpheme for each sentence. For this reason, the term that matches the condition first found in this example is that of the number “20”, and the part of speech of this morpheme is “independent verb”.

ステップＳ２２５では、番号「２０」の表層語が「高まっ」となっているために、これが基本形の「高まる」に変更されて、行為用語は「高まる」となる。 In step S225, since the surface word of the number “20” is “increased”, this is changed to “increased” in the basic form, and the action term becomes “increased”.

以上のようにして行為用語抽出部１２１Ａによる行為用語の抽出が終了すると、行為ベクトル生成部１２３Ａに処理が移る。 When the action term extraction by the action term extraction unit 121A is completed as described above, the process moves to the action vector generation unit 123A.

図１３は、行為用語抽出部と行為ベクトル生成部の処理の様子を表わしたものである。ここでは、図９に示す数値Ｍが「３」であるとして、１編の文書データ１０７が第１〜第３のセグメント１３２₁〜１３２₃に分割されたものとして、説明を簡略化する。 FIG. 13 illustrates the processing of the action term extraction unit and the action vector generation unit. Here, assuming that the numerical value M shown in FIG. 9 is “3”, the description is simplified on the assumption that one document data 107 is divided into _first to third segments 132 _{1 to} 132 ₃ .

既に説明したように、行為用語抽出部１２１Ａによって第１〜第３のセグメント１３２₁〜１３２₃から第１〜第３の行為用語リスト１４１₁〜１４１₃が抽出される。たとえば、第１のセグメント１３２₁には５文があり、行為用語抽出部１２１Ａによって第１の行為用語のリスト１４１₁が抽出される。次に行為ベクトル生成部１２３Ａは、第１の行為用語のリスト１４１₁から第１の行為ベクトル１４２₁というベクトルを生成する。 As already described, the first to third segments 132 _1-132 from ₃ first to third acts term list 141 _1-141 ₃ is extracted by the act term extraction unit 121A. For example, there are five sentences in the first segment 132 ₁ , and the first action term list 141 ₁ is extracted by the action term extraction unit 121A. Next, the action vector generation unit 123A generates a vector called a _first action vector 142 ₁ from the first action term list 141 ₁ .

数学の世界でベクトルは、成分とその値の組の集合で表わされる。コンピュータによる処理では値が「０」の組は意味がない。このため、図１３に示すような値が「０」以外の組のみを持つハッシュテーブルで表わすことが一般的である。図１３の行為ベクトルでは、成分を「見出し語」とし、値を「頻度」としている。ハッシュテーブルは配列とは異なり、見出し語数に関わらず一定時間で見出し語の値（この場合は頻度）を参照することができるので、ベクトルを表現するのに適している。 In the mathematical world, a vector is represented by a set of components and their values. In the processing by the computer, a group with a value “0” is meaningless. For this reason, it is common to represent a hash table having only values other than “0” as shown in FIG. In the action vector of FIG. 13, the component is “headword” and the value is “frequency”. Unlike an array, the hash table is suitable for expressing a vector because the value of the entry word (frequency in this case) can be referred to in a fixed time regardless of the number of entry words.

行為ベクトル生成部１２３Ａは、第１〜第３の行為用語リスト１４１₁〜１４１₃に含まれる用語を順にハッシュテーブルとしての第１〜第３の行為ベクトル１４２₁〜１４２₃に登録する。未登録の用語（見出し語）の場合は頻度を「１」とし、既に登録済みの用語の場合は頻度を「１」だけ加算する。このようにして第１〜第３の行為用語リスト１４１₁〜１４１₃のそれぞれについて第１〜第３の行為ベクトル１４２₁〜１４２₃を生成する。 It acts vector generation unit 123A registers to the first to third action vector 142 _1-142 ₃ of the terms contained in the first to third acts term list 141 _1-141 ₃ as order hash tables. In the case of an unregistered term (headword), the frequency is set to “1”, and in the case of an already registered term, the frequency is added by “1”. In this way, the first to third action vectors 142 _{1 to} 142 ₃ are generated for each of the first to _third action term lists 141 _{1 to} 141 ₃ .

第１〜第３の行為ベクトル１４２₁〜１４２₃を生成したら、これら第１〜第３の行為ベクトル１４２₁〜１４２₃の和を求めて、文書全体の統合ベクトル１４３を生成する。このような処理過程で文書全体の統合ベクトル１４３を生成する代わりに、処理の最初から空の統合ベクトル１４３を用意しておき、第１〜第３の行為ベクトル１４２₁〜１４２₃のそれぞれを作成する際に第１〜第３の行為ベクトル１４２₁〜１４２₃の「頻度」をそのまま合算するようにしてもよい。あるいは、第１〜第３のセグメント１３２₁〜１３２₃の頻度を「１」と解釈して統合ベクトル１４３を作成してもよい。 When the first to third action vectors 142 _{1 to} 142 ₃ are generated, the sum of these first to third action vectors 142 _{1 to} 142 ₃ is obtained to generate an integrated vector 143 of the entire document. Instead of generating the integrated vector 143 of the entire document in such a process, an empty integrated vector 143 is prepared from the beginning of the process, and each of the first to third action vectors 142 _{1 to} 142 ₃ is created. In doing so, the “frequency” of the _first to third action vectors 142 _{1 to} 142 ₃ may be added together. Alternatively, the integrated vector 143 may be generated by interpreting the frequency of the _first to third segments 132 _{1 to} 132 ₃ as “1”.

行為ベクトル生成部１２３Ａは、以上のようにして第１〜第３の行為ベクトル１４２₁〜１４２₃とこれらの統合ベクトル１４３を作成したら、これらを図９に示した行為ベクトル記憶部１２６に記憶させる。行為ベクトル生成部１２３Ａから行為ベクトル記憶部１２６への第１〜第３の行為ベクトル１４２₁〜１４２₃の記憶は、統合ベクトル１４３を作成した段階で統合ベクトル１４３と共に一括して行ってもよいし、第１〜第３の行為ベクトル１４２₁〜１４２₃の記憶を個々に行い、統合ベクトル１４３の記憶をその後に行うようにしてもよい。 When the action vector generation unit 123A generates the first to third action vectors 142 _{1 to} 142 ₃ and the integrated vector 143 thereof as described above, the action vector generation unit 123A stores them in the action vector storage unit 126 illustrated in FIG. . First to third action vector 142 _1-142 ₃ storage from Acts vector generation unit 123A to act vector storage unit 126 may be performed together with integrated vector 143 at the stage of creating integrated vector 143 The first to third action vectors 142 _{1 to} 142 ₃ may be stored individually, and the integrated vector 143 may be stored thereafter.

図９に示した核セグメント判定部１２８は、行為ベクトル記憶部１２６に記憶されている第１〜第Ｍのセグメント１３２₁〜１３２_Mの第１〜第Ｍの行為ベクトル１４２₁〜１４２_Mと統合ベクトル１４３を順に比較して類似度を計算する。そして、最も類似度の高いセグメントを核セグメント１１１とする。 Nuclear segment determination unit 128 shown in FIG. 9, integrated with acts vector 142 ₁ -142 _M first to M segments 132 ₁ to 132 _M of the first to M stored in the action vector storage unit 126 The vectors 143 are compared in order to calculate the similarity. The segment with the highest similarity is set as the core segment 111.

第１〜第Ｍの行為ベクトル１４２₁〜１４２_Mと統合ベクトル１４３との類似度計算は、コサイン尺度を用いる。コサイン尺度とは、次の式（１）で表わされるベクトル間の類似度を計算する一般的な指標である。 The cosine scale is used to calculate the similarity between the _{first to} Mth action vectors 142 _{1 to} 142 _M and the integrated vector 143. The cosine scale is a general index for calculating the similarity between vectors represented by the following formula (1).

図１４は、コサイン尺度を用いて各セグメントの行為ベクトルと統合ベクトルの類似度を計算した例を示したものである。この図１４では、図９に示す数値Ｍが「４」であるとして、すなわち１編の文書データが４つのセグメントに分割されるものとして、説明を簡略化する。 FIG. 14 shows an example in which the similarity between the action vector of each segment and the integrated vector is calculated using a cosine scale. In FIG. 14, the description is simplified on the assumption that the numerical value M shown in FIG. 9 is “4”, that is, one document data is divided into four segments.

統合ベクトル１４３は、第１〜第４の行為ベクトル１４２₁〜１４２₄における「頻度」をそのまま合算している。この図１４には類似度（ｃｏｓ（Ｓｉ，Ｖ））も示している。ここで、「Ｓｉ」は、第ｉの行為ベクトル１４２_iを表わしており、「Ｖ」は統合ベクトル１４３を表わしている。第１の行為ベクトル１４２₁と統合ベクトル１４３の類似度ｃｏｓ（Ｓ１，Ｖ）は「０．４７」であり、第２の行為ベクトル１４２₁と統合ベクトル１４３の類似度ｃｏｓ（Ｓ２，Ｖ）は「０．６７」である。また、第３の行為ベクトル１４２₁と統合ベクトル１４３の類似度ｃｏｓ（Ｓ３，Ｖ）は「０．５６」であり、第４の行為ベクトル１４２₄と統合ベクトル１４３の類似度ｃｏｓ（Ｓ４，Ｖ）は「０．６２」である。 The integrated vector 143 adds the “frequency” in the _first to fourth action vectors 142 _{1 to} 142 ₄ as they are. FIG. 14 also shows the similarity (cos (Si, V)). Here, “Si” represents the i-th action vector 142 _i , and “V” represents the integrated vector 143. The similarity cos (S1, V) between the first action vector 142 ₁ and the integrated vector 143 is “0.47”, and the similarity cos (S2, V) between the second action vector 142 ₁ and the integrated vector 143 is “0.67”. Further, the similarity cos of the third act vectors 142 ₁ and integrated vector 143 (S3, V) is "0.56", the similarity cos (S4 of the fourth action vectors 142 ₄ and integrated vector 143, V ) Is “0.62”.

このように図１４に示した例では、第２の行為ベクトル１４２₁と統合ベクトル１４３の類似度ｃｏｓ（Ｓ２，Ｖ）が最高の「０．６７」という値になっている。したがって、図９に示した核セグメント判定部１２８は、第２の行為ベクトル１４２₁に係わる第２のセグメント１３２₂を核セグメント１１１Ａと判断する。 In the example shown this way in Fig. 14, it has a value of "0.67" of the similarity cos (S2, V) is the best of the second action vectors 142 ₁ and integrated vector 143. Accordingly, the nuclear segment determination unit 128 illustrated in FIG. 9 determines that the second segment 132 ₂ related to the second action vector 142 ₁ is the nuclear segment 111A.

核セグメント判定部１２８によって文書の核セグメント１１１Ａが特定されると、出力部１１２によって該当する文書の核セグメントが出力される。核セグメントの出力形式は、セグメントの番号でも構わないし、セグメントそのものでも構わない。 When the core segment 111A of the document is specified by the core segment determination unit 128, the core segment of the corresponding document is output by the output unit 112. The output format of the nuclear segment may be a segment number or the segment itself.

なお、第１の実施例では核セグメント判定部１２８が第１〜第４の行為ベクトル１４２₁〜１４２_Mと統合ベクトル１４３の類似度を判別し、最も高い値を持つセグメント１３２を核セグメント１１１Ａとしたが、これに限るものではない。たとえば最も類似度が高くかつ予め設定した閾値以上のセグメントを核セグメント１１１Ａとしてもよい。また、予め定めた閾値以上の類似度を持つセグメントの中から、文書中のセグメントの位置がもっとも前方（あるいは後方）に位置するセグメントを核セグメント１１１Ａとしてもよい。 In the first embodiment, the nuclear segment determination unit 128 determines the similarity between the _first to fourth action vectors 142 _{1 to} 142 _M and the integrated vector 143, and determines the segment 132 having the highest value as the nuclear segment 111A. However, it is not limited to this. For example, a segment having the highest similarity and not less than a preset threshold may be used as the core segment 111A. In addition, a segment in which the position of the segment in the document is located in the foremost (or back) among the segments having a similarity equal to or higher than a predetermined threshold may be set as the core segment 111A.

また、第１の実施例では文書の言語が日本語の場合を例として挙げ、行為用語は、文の最後尾に位置する特定の動詞または特定の名詞となるものとしたが、これに限るものではない。たとえば英語の場合には、文の最初に出てくる動詞を行為用語とすることができる。たとえば、「This paper proposes a novel approach to accurately searching Web pages for relevant information in problem solving.」という英文があったとする。この場合には、最初の動詞「proposes」を抽出し、その原形である「propose」を行為用語とすることができる。 In the first embodiment, the document language is Japanese as an example, and the action term is a specific verb or a specific noun located at the end of the sentence. is not. For example, in the case of English, a verb appearing at the beginning of a sentence can be used as an action term. For example, suppose there is an English sentence "This paper proposes a novel approach to accurately searching Web pages for relevant information in problem solving." In this case, the first verb “proposes” can be extracted and the original form “propose” can be used as an action term.

このように第１の実施例によれば、言語の種類に応じてそれぞれの文の最後尾や文頭から順にチェックすることにしたので、文章構造を利用して、行為用語を簡単に抽出することができる。 As described above, according to the first embodiment, since the sentence is checked in order from the end or the beginning of the sentence according to the language type, the action terms can be easily extracted using the sentence structure. Can do.

図１５は、本発明の第２の実施例におけるテキスト処理システムを使用した情報処理装置の構成を表わしたものである。この図１５に示した本実施例の情報処理装置１００Ｂにおけるテキスト処理システム１１０Ｂで、図９と同一部分には同一の符号を付しており、これらの説明を適宜省略する。 FIG. 15 shows the configuration of an information processing apparatus using the text processing system in the second embodiment of the present invention. In the text processing system 110B in the information processing apparatus 100B of the present embodiment shown in FIG. 15, the same parts as those in FIG. 9 are denoted by the same reference numerals, and description thereof will be omitted as appropriate.

第２の実施例のテキスト処理システム１１０Ｂでは、第１の実施例の場合と同一の行為用語抽出部１２１Ａおよび行為ベクトル生成部１２３Ａを使用しており、核セグメント判定部１２８Ｂのみの構成が異なっている。また、制御部１０３Ｂのメモリ１０２Ｂは、第２の実施例に応じた制御プログラムを格納している。そこで、以下の説明では核セグメント判定部１２８Ｂの構成とその動作を中心として説明を行う。 The text processing system 110B of the second example uses the same action term extraction unit 121A and action vector generation unit 123A as in the first example, and the configuration of only the nuclear segment determination unit 128B is different. Yes. In addition, the memory 102B of the control unit 103B stores a control program according to the second embodiment. Therefore, the following description will focus on the configuration and operation of the nuclear segment determination unit 128B.

第２の実施例の核セグメント判定部１２８Ｂは、行為ベクトル記憶部１２６に記憶されている統合ベクトルから頻度の高い行為用語を選択する。そして、各セグメントの行為ベクトルでの該当する行為用語の出現確率を計算して、最も確率の高いセグメントを核セグメントとするようになっている。 The nuclear segment determination unit 128 </ b> B of the second embodiment selects an action term having a high frequency from the integrated vectors stored in the action vector storage unit 126. Then, the appearance probability of the corresponding action term in the action vector of each segment is calculated, and the segment with the highest probability is set as the core segment.

次の式（２）は、出現確率を示したものである。ただし、式（２）は出現確率を求める数式の一例であり、これに限るものではない。 The following equation (2) shows the appearance probability. However, Expression (2) is an example of an expression for obtaining the appearance probability, and is not limited to this.

図１６は、この第２の実施例における各セグメントの行為ベクトルと統合ベクトルの類似度を計算した例を示したものである。この図１６では、図１５に示す数値Ｍが「４」であるとして、すなわち１編の文書データが４つのセグメントに分割されるものとして、説明を簡略化する。 FIG. 16 shows an example of calculating the similarity between the action vector of each segment and the integrated vector in the second embodiment. In FIG. 16, the description is simplified on the assumption that the numerical value M shown in FIG. 15 is “4”, that is, one document data is divided into four segments.

統合ベクトル１４３は、第１〜第４の行為ベクトル１４２₁〜１４２₄における「頻度」をそのまま合算している。この図１６には出現確率（ｐ（Ｓｉ，Ｖ））も示している。ここで、「Ｓｉ」は、第ｉの行為ベクトル１４２_iを表わしており、「Ｖ」は統合ベクトル１４３を表わしている。
第１の行為ベクトル１４２₁と統合ベクトル１４３の出現確率（ｐ（Ｓ１，Ｖ））は「０．２」であり、第２の行為ベクトル１４２₁と統合ベクトル１４３の出現確率（ｐ（Ｓ２，Ｖ））は「０．６」である。また、第３の行為ベクトル１４２₁と統合ベクトル１４３の出現確率（ｐ（Ｓ３，Ｖ））は「０．５」であり、第４の行為ベクトル１４２₄と統合ベクトル１４３の出現確率（ｐ（Ｓ４，Ｖ））は「０．５」である。 The integrated vector 143 adds the “frequency” in the _first to fourth action vectors 142 _{1 to} 142 ₄ as they are. FIG. 16 also shows the appearance probability (p (Si, V)). Here, “Si” represents the i-th action vector 142 _i , and “V” represents the integrated vector 143.
The appearance probability (p (S1, V)) of the first action vector 142 ₁ and the integrated vector 143 is “0.2”, and the appearance probability of the second action vector 142 ₁ and the integrated vector 143 (p (S2, V2) V)) is “0.6”. Also, the probability of occurrence of the third act vectors 142 ₁ and integrated vector 143 (p (S3, V) ) is "0.5", the probability of occurrence of the fourth action vectors 142 ₄ and integrated vector 143 (p ( S4, V)) is "0.5".

このように図１６に示した例では、第２の行為ベクトル１４２₁と統合ベクトル１４３の出現確率（ｐ（Ｓ２，Ｖ））が最高の「０．６」という値になっている。したがって、図１５に示した核セグメント判定部１２８Ｂは、第２の行為ベクトル１４２₁に係わる第２のセグメント１３２₂を核セグメント１１１Ｂと判断する。 In this way, in the example shown in FIG. 16, the appearance probability (p (S2, V)) of the second action vector 142 ₁ and the integrated vector 143 has the highest value of “0.6”. Accordingly, the nuclear segment determination unit 128B illustrated in FIG. 15 determines that the second segment 132 ₂ related to the second action vector 142 ₁ is the nuclear segment 111B.

核セグメント判定部１２８によって文書の核セグメント１１１Ｂが特定されると、出力部１１２によって該当する文書の核セグメントが出力される。核セグメントの出力形式は、セグメントの番号でも構わないし、セグメントそのものでも構わない。 When the core segment 111B of the document is specified by the core segment determination unit 128, the core segment of the corresponding document is output by the output unit 112. The output format of the nuclear segment may be a segment number or the segment itself.

なお、第２の実施例では核セグメント判定部１２８Ｂが第１〜第４の行為ベクトル１４２₁〜１４２_Mと統合ベクトル１４３の出現確率を判別し、最も高い値を持つセグメント１３２を核セグメント１１１Ｂとしたが、これに限るものではない。たとえば最も出現確率が高くかつ予め設定した閾値以上のセグメントを核セグメント１１１としてもよい。また、予め定めた閾値以上の出現確率を持つセグメントの中から、文書中のセグメントの位置が最も前方（あるいは後方）に位置するセグメントを核セグメント１１１Ｂとしてもよい。 In the second embodiment, the nuclear segment determination unit 128B determines the appearance probabilities of the _first to fourth action vectors 142 _{1 to} 142 _M and the integrated vector 143, and determines the segment 132 having the highest value as the nuclear segment 111B. However, it is not limited to this. For example, a segment having the highest appearance probability and not less than a preset threshold value may be used as the core segment 111. Further, a segment in which the position of the segment in the document is located in the forefront (or the back) among the segments having the appearance probability equal to or higher than a predetermined threshold may be set as the core segment 111B.

＜発明の第２の実施の形態＞ <Second Embodiment of the Invention>

次に本発明の第２の実施の形態を説明する。 Next, a second embodiment of the present invention will be described.

図１７は、本発明の第２の実施の形態によるテキスト処理システムを使用した情報処理装置の構成を表わしたものである。図１７で第１の実施の形態の図７と同一部分には同一の符号を付しており、これらの説明を適宜省略する。 FIG. 17 shows the configuration of an information processing apparatus using a text processing system according to the second embodiment of the present invention. In FIG. 17, the same parts as those in FIG. 7 of the first embodiment are denoted by the same reference numerals, and description thereof will be omitted as appropriate.

第２の実施の形態の情報処理装置１００Ｃは、ＣＰＵ１０１および制御プログラムをその少なくとも一部に格納するメモリ１０２Ｃを備えた制御部１０３Ｃを有している。制御部１０３Ｃは、テキスト処理システム１１０Ｃを構成する各部の制御を行うようになっている。 The information processing apparatus 100C according to the second embodiment includes a control unit 103C including a CPU 101 and a memory 102C that stores at least part of a control program. The control unit 103C controls each unit constituting the text processing system 110C.

テキスト処理システム１１０Ｃは、行為用語抽出部１２１と、行為ベクトル生成部１２３と、行為ベクトル記憶部１２６および核セグメント決定部３０１から構成されている。すなわち、第２の実施の形態の情報処理装置１００Ｃは第１の実施の形態の情報処理装置１００と比較すると、核セグメント判定部１２８（図７）の代わりに核セグメント決定部３０１が配置された構成となっている。核セグメント決定部３０１は、行為ベクトル記憶部１２６に格納された統合ベクトル１２４と各セグメントの行為ベクトル１２５を比較して、最適なセグメントである核セグメントを決定するようになっている。核セグメント決定部３０１から出力される核セグメント１１１Ｃは、出力部１１２に供給されることになる。 The text processing system 110C includes an action term extraction unit 121, an action vector generation unit 123, an action vector storage unit 126, and a nuclear segment determination unit 301. That is, in comparison with the information processing apparatus 100 of the first embodiment, the information processing apparatus 100C of the second embodiment includes a nuclear segment determination unit 301 instead of the nuclear segment determination unit 128 (FIG. 7). It has a configuration. The nuclear segment determination unit 301 compares the integrated vector 124 stored in the action vector storage unit 126 with the action vector 125 of each segment, and determines a nuclear segment that is an optimal segment. The nucleus segment 111C output from the nucleus segment determination unit 301 is supplied to the output unit 112.

このように、第２の実施の形態ではテキスト処理システム１１０Ｃで、文書の各セグメントの各文からこれらの文が表わす端的な意味を行為用語として抽出する行為用語抽出部１２１と、各セグメントごとに抽出した行為用語のベクトルと全セグメントのベクトルを総和した統合ベクトルを生成する行為ベクトル生成部１２３の動作は第１の実施の形態のテキスト処理システム１１０と異ならない。そこで、核セグメント決定部３０１を中心に第２の実施の形態のテキスト処理システム１１０Ｃを説明する。 As described above, in the second embodiment, in the text processing system 110C, the action term extraction unit 121 that extracts, as action terms, the simple meanings represented by these sentences from each sentence of each segment of the document, and for each segment The operation of the action vector generation unit 123 that generates an integrated vector obtained by summing the extracted action term vector and all segment vectors is not different from that of the text processing system 110 of the first embodiment. Therefore, the text processing system 110C of the second embodiment will be described focusing on the nuclear segment determination unit 301.

核セグメント決定部３０１は、図７に示した第１の実施の形態における核セグメント判定部１２８と同様に、まず、行為ベクトル記憶部１２６に記憶されている統合ベクトル１２４と各セグメントの行為ベクトル１２５を順に比較して最適なセグメントを判定する。この比較は、第１の実施の形態の第１の実施例の場合の核セグメント判定部１２８と同様に、第１〜第Ｍの行為ベクトル１４２₁〜１４２_Mと統合ベクトル１４３を順に比較して「類似度」を計算してもよい。あるいは第１の実施の形態の第２の実施例の場合の核セグメント判定部１２８と同様に、第１〜第Ｍの行為ベクトル１４２₁〜１４２_Mにおける「頻度」をそのまま合算するようにしてもよい。 Similar to the nuclear segment determination unit 128 in the first embodiment shown in FIG. 7, the nuclear segment determination unit 301 firstly combines the integrated vector 124 and the action vector 125 of each segment stored in the action vector storage unit 126. Are compared in order to determine the optimum segment. This comparison is performed by sequentially comparing the _{first to} Mth action vectors 142 _{1 to} 142 _M and the integrated vector 143 in the same manner as in the nuclear segment determination unit 128 in the first example of the first embodiment. “Similarity” may be calculated. Alternatively, as in the nuclear segment determination unit 128 in the second example of the first embodiment, the “frequency” in the _{first to} Mth action vectors 142 _{1 to} 142 _M may be added as it is. Good.

このような統合ベクトル１２４と各セグメントの行為ベクトル１２５の比較によっては最適なセグメントが判定できなかったとする。この場合、第２の実施の形態の核セグメント決定部３０１は、隣接する複数のセグメントの行為ベクトルの和ベクトルを求める。そして、これらの和ベクトルと統合ベクトルを順に比較して得られた比較データ１２９Ｃによって最適なセグメントを判定する。 It is assumed that the optimum segment cannot be determined by comparing the integrated vector 124 and the action vector 125 of each segment. In this case, the core segment determination unit 301 according to the second embodiment obtains a sum vector of action vectors of a plurality of adjacent segments. Then, an optimum segment is determined based on comparison data 129C obtained by sequentially comparing the sum vector and the integrated vector.

図１８は、核セグメント判定部における隣接する複数のセグメントの行為ベクトルの和ベクトルを求める様子の一例を表わしたものである。図１７と共に説明する。この図１８では、１編の文書データが４つのセグメントに分割されるものとして、説明を簡略化する。 FIG. 18 shows an example of how the sum vector of action vectors of a plurality of adjacent segments is obtained in the nuclear segment determination unit. This will be described with reference to FIG. In FIG. 18, the description is simplified on the assumption that one document data is divided into four segments.

核セグメント判定部３０１は、まず第１の行為ベクトル１４２₁と第２の行為ベクトル１４２₂を結合した行為ベクトルとしての第１＋第２の行為ベクトルの和ベクトル３１１₁₊₂を算出する（ステップＳ４０１）。次に、第２の行為ベクトル１４２₂と第３の行為ベクトル１４２₃を結合した行為ベクトルとしての第２＋第３の行為ベクトルの和ベクトル３１１₂₊₃を算出する（ステップＳ４０２）。更に、第３の行為ベクトル１４２₃と第４の行為ベクトル１４２₄を結合した行為ベクトルとしての第３＋第４の行為ベクトルの和ベクトル３１１₃₊₄を算出する（ステップＳ４０３）。 First, the nuclear segment determination unit 301 calculates the sum vector 311 _{1 + 2} of the _{first + second} action vector as an action vector obtained by combining the first action vector 142 ₁ and the second action vector 142 ₂ (step S401). ). Then, to calculate the sum vector 311 _{2 + 3} of the 2+ third act vectors as acts vector obtained by combining the second action vectors 142 ₂ and the third acts vector 142 ₃ (step S402). Further calculates a sum vector 311 _{3 + 4} of the 3+ fourth acts vector as acts vector combined with the third action vector 142 ₃ fourth action vector 142 ₄ (step S403).

このようにして算出された第１＋第２の行為ベクトルの和ベクトル３１１₁₊₂、第２＋第３の行為ベクトルの和ベクトル３１１₂₊₃および第３＋第４の行為ベクトルの和ベクトル３１１₃₊₄と、統合ベクトル１２４（図１６参照）との類似度が、それぞれ「０．７６」、「０．８７」、「０．７６」であったとする。この例の場合、核セグメント判定部３０１は、第２＋第３の行為ベクトルの和ベクトル３１１₂₊₃である第２のセグメント１３２₂および第３のセグメント１３２₃の組み合わされたものを核セグメント１１１Ｃとして決定し、出力部１１２に供給することになる。 The sum vector 311 _{1 + 2} of the _{first + second} action vector, the sum vector 311 _{2 + 3} of the _{second + third} action vector, and the sum vector 311 _{3+ of the} _third + fourth action vector thus calculated. Assume that the similarities between ₄ and the integrated vector 124 (see FIG. 16) are “0.76”, “0.87”, and “0.76”, respectively. In this example, the nuclear segment determination unit 301 uses a combination of the second segment 132 ₂ and the third segment 132 ₃ , which is the sum vector 311 _{2 + 3} of the second and _third action vectors, as the nuclear segment 111C. And supplied to the output unit 112.

以上説明したように本発明の第２の実施の形態では、さまざまな文書から該文書の要約となりうる文章として品質の高い部分テキストを抽出することができる、という効果がある。文章として品質の高い部分テキストを抽出できるという理由は、文書中のセグメントまたは隣接するセグメントを選択するためであり、セグメントまたは隣接するセグメントは文間のつながりが自然であり、一つのトピックについて書かれているため文章としてまとまっているからである。また、さまざまな文書に対応できるという理由は、小見出しを使わず、文書全体で用いられる行為用語と類似した行為用語を用いている１つ以上の隣接するセグメントまたは文書全体で頻度の高い行為用語を多く使っている１つ以上の隣接するセグメントを選択するという手法を用いるため、辞書やルールを必要としないからである。 As described above, according to the second embodiment of the present invention, there is an effect that a high-quality partial text can be extracted from various documents as a sentence that can be a summary of the document. The reason why high-quality partial text can be extracted as a sentence is to select a segment in the document or an adjacent segment, and the segment or the adjacent segment has a natural connection between sentences and is written on one topic. This is because they are organized as sentences. Also, the reason for being able to handle various documents is that one or more adjacent segments that use action terms similar to the action terms used throughout the document, without subheadings, or frequent action terms throughout the document. This is because a method of selecting one or more adjacent segments that are frequently used is used, so that a dictionary and rules are not required.

図１９は、本発明の第２の実施の形態におけるテキスト処理システムを使用した情報処理装置の構成の具体例を、第３の実施例として表わしたものである。この図１９に示した第３の実施例の情報処理装置１００Ｄにおけるテキスト処理システム１１０Ｄで、図９あるいは図１７と同一部分には同一の符号を付しており、これらの説明を適宜省略する。また、図１９でも、１編の文書データが４つのセグメントに分割されるものとして、説明を簡略化する。 FIG. 19 shows a specific example of the configuration of the information processing apparatus using the text processing system according to the second embodiment of the present invention as a third example. In the text processing system 110D in the information processing apparatus 100D of the third embodiment shown in FIG. 19, the same parts as those in FIG. 9 or FIG. 17 are denoted by the same reference numerals, and description thereof will be omitted as appropriate. Also in FIG. 19, the description is simplified on the assumption that one document data is divided into four segments.

第３の実施例のテキスト処理システム１１０Ｃでは、図９に示した第１の実施例の場合と同一の行為用語抽出部１２１Ａおよび行為ベクトル生成部１２３Ａを使用している。また、図１７に示した核セグメント決定部３０１を使用している。また、制御部１０３Ｄのメモリ１０２Ｄは、このような構成の第３の実施例に応じた制御プログラムを格納している。そこで、以下の説明では核セグメント決定部３０１の動作を中心として説明を行う。 The text processing system 110C of the third embodiment uses the same action term extraction unit 121A and action vector generation unit 123A as in the case of the first embodiment shown in FIG. Further, the nuclear segment determination unit 301 shown in FIG. 17 is used. Further, the memory 102D of the control unit 103D stores a control program according to the third embodiment having such a configuration. Therefore, in the following description, the operation of the nuclear segment determination unit 301 will be mainly described.

核セグメント決定部３０１は、行為ベクトル記憶部１２６に記憶されている第１〜第４の行為ベクトル１４２₁〜１４２_Mと統合ベクトル１４３を順に比較して類似度を算出する。算出された類似度には、該当するセグメントが核セグメントとして決定されるための閾値が設定されている。第３の実施例では、この閾値を「０．７」とするものとする。 The nuclear segment determination unit 301 compares the _first to fourth action vectors 142 _{1 to} 142 _M stored in the action vector storage unit 126 with the integrated vector 143 in order to calculate the similarity. In the calculated similarity, a threshold is set for determining the corresponding segment as a core segment. In the third embodiment, this threshold value is assumed to be “0.7”.

第１〜第４の行為ベクトル１４２₁〜１４２_Mと統合ベクトル１４３の類似度計算は、第１の実施例と同様にコサイン尺度を用いるものとする。コサイン尺度を用いて各セグメントの行為ベクトルと統合ベクトルの類似度を計算した例は第１の実施例と同様に図１４のようになる。すなわち、第１の行為ベクトル１４２₁と統合ベクトル１４３の類似度ｃｏｓ（Ｓ１，Ｖ）は「０．４７」であり、第２の行為ベクトル１４２₁と統合ベクトル１４３の類似度ｃｏｓ（Ｓ２，Ｖ）は「０．６７」である。また、第３の行為ベクトル１４２₁と統合ベクトル１４３の類似度ｃｏｓ（Ｓ３，Ｖ）は「０．５６」であり、第４の行為ベクトル１４２₄と統合ベクトル１４３の類似度ｃｏｓ（Ｓ４，Ｖ）は「０．６２」である。 Similarity calculation between the _first to fourth action vectors 142 _{1 to} 142 _M and the integrated vector 143 uses the cosine scale as in the first embodiment. An example in which the similarity between the action vector and the integrated vector of each segment is calculated using the cosine measure is as shown in FIG. 14 as in the first embodiment. That is, the similarity cos (S1, V) between the first action vector 142 ₁ and the integrated vector 143 is “0.47”, and the similarity cos (S2, V) between the second action vector 142 ₁ and the integrated vector 143 ) Is “0.67”. Further, the similarity cos of the third act vectors 142 ₁ and integrated vector 143 (S3, V) is "0.56", the similarity cos (S4 of the fourth action vectors 142 ₄ and integrated vector 143, V ) Is “0.62”.

第１の実施例では、第２の行為ベクトル１４２₁と統合ベクトル１４３の類似度ｃｏｓ（Ｓ２，Ｖ）が最高の「０．６７」となっており、図９に示した核セグメント判定部１２８が、第２の行為ベクトル１４２₁に係わる第２のセグメント１３２₂を核セグメント１１１Ａと判断するようにした。第３の実施例では、いずれの行為ベクトルの類似度も閾値である「０．７」未満である。このため、核セグメント判定部３０１は、隣接するセグメントを結合して、新たな行為ベクトルとして、第１＋第２の行為ベクトルの和ベクトル３１１₁₊₂、第２＋第３の行為ベクトルの和ベクトル３１１₂₊₃および第３＋第４の行為ベクトルの和ベクトル３１１₃₊₄を生成する。そして、これら第１＋第２の行為ベクトルの和ベクトル３１１₁₊₂、第２＋第３の行為ベクトルの和ベクトル３１１₂₊₃および第３＋第４の行為ベクトルの和ベクトル３１１₃₊₄と、統合ベクトル１２４との類似度を再計算する。 In the first example, the similarity cos (S2, V) between the second action vector 142 ₁ and the integrated vector 143 is the highest “0.67”, and the nuclear segment determination unit 128 shown in FIG. However, the second segment 132 ₂ related to the second action vector 142 ₁ is determined as the nucleus segment 111A. In the third embodiment, the similarity of any action vector is less than the threshold value “0.7”. For this reason, the nuclear segment determination unit 301 combines adjacent segments to generate new action vectors, which are the first + second action vector sum vector 311 _{1 + 2} and the second + third action vector sum vector 311. A sum vector 311 _{3 + 4} of _{2 + 3} and _{3 + 4th} action vector is generated. Then, the sum vector 311 _{1 + 2} of the _{first + second} action vector, the sum vector 311 _{2 + 3} of the _{second + third} action vector, and the sum vector 311 _{3 + 4 of the third + fourth} action vector are integrated. The similarity with the vector 124 is recalculated.

これら類似度の再計算の様子は、先の第２の実施の形態における図１８に示した例と同様になる。再計算の結果として、第２＋第３の行為ベクトルの和ベクトル３１１₂₊₃と統合ベクトル１２４との類似度が「０．８７」となる。算出された類似度「０．８７」は、閾値「０．７」以上となっている。したがって、核セグメント決定部３０１は、第２＋第３の行為ベクトルの和ベクトル３１１₂₊₃に対応する第２のセグメント１３２₂および第３のセグメント１３２₃を結合セグメントの核セグメント１１１Ｄと決定する。 The state of recalculation of these similarities is the same as the example shown in FIG. 18 in the second embodiment. As a result of the recalculation, the similarity between the sum vector 311 _{2 + 3 of the second and third} action vectors and the integrated vector 124 is “0.87”. The calculated similarity “0.87” is equal to or greater than the threshold “0.7”. Therefore, the nucleus segment determination unit 301 determines the second segment 132 ₂ and the third segment 132 ₃ corresponding to the sum vector 311 _{2 + 3} of the _{2 + third} action vector as the nucleus segment 111D of the combined segment.

以上説明した第３の実施例では、類似度が最も高い値を持つ結合セグメントを核セグメントとしたが、これに限定するものではない。たとえば、閾値以上の類似度の結合セグメントのうち、文書中の結合セグメントの位置が最も前方（あるいは後方）に位置する結合セグメントを核セグメントとするようにしてもよい。 In the third embodiment described above, the combined segment having the highest similarity is defined as the core segment, but the present invention is not limited to this. For example, a combined segment in which the position of the combined segment in the document is located in the forefront (or the back) among the combined segments having a similarity equal to or higher than a threshold may be used as the core segment.

また、第３の実施例では、隣り合った１つずつのセグメントを結合して結合セグメントを生成する場合について説明したが、これに限定するものではない。すなわち、合計２つのセグメントからなる結合セグメントと統合ベクトル１２４の間で類似度の大小を比較しても核セグメント１１１Ｄが決定できない場合があり、このような場合に、閾値以上の類似度を持つ結合セグメントが出てくるまで順に結合するセグメントを増加してもよい。また、このような結合セグメントのセグメント数を順に増加させる手順を踏まずに、最初から所定数以上のセグメントを結合してもよい。 In the third embodiment, the case where the adjacent segments are combined to generate the combined segment has been described. However, the present invention is not limited to this. That is, there is a case where the core segment 111D cannot be determined by comparing the similarity between the combined segment consisting of a total of two segments and the integrated vector 124. The number of segments to be combined may be increased in order until a segment appears. Further, a predetermined number or more segments may be combined from the beginning without following the procedure for sequentially increasing the number of combined segments.

以上詳細に説明したように、本発明を用いることで、さまざまな文書の要約や主張となる部分テキストを高精度に抽出することが可能となる。これによって、情報検索におけるインデックスやスニペット（snippet）にこの部分テキストを用いることで高精度な検索が可能となり、業務効率の改善が見込むことができる。また、情報調査を行う場合には、調査の結果得られた文書のすべての箇所を閲読する必要がなくなり、調査の高速化による業務効率の改善を見込むことができる。また、本発明は作成した文書の要点の見直しや文書の校正にも利用することが可能となり、業務効率の改善を図ることができる。 As described above in detail, by using the present invention, it is possible to extract a partial text as a summary or assertion of various documents with high accuracy. As a result, by using this partial text for an index or snippet in information search, high-precision search becomes possible, and improvement in business efficiency can be expected. In addition, when conducting an information survey, it is not necessary to read all parts of the document obtained as a result of the survey, and it is possible to expect an improvement in work efficiency by speeding up the survey. In addition, the present invention can be used for reviewing the main points of the created document and proofreading the document, thereby improving work efficiency.

本発明のテキスト処理システムのクレーム対応図である。It is a claim correspondence diagram of the text processing system of the present invention. 本発明の情報処理装置のクレーム対応図である。It is a claim corresponding | compatible figure of the information processing apparatus of this invention. 本発明のテキスト処理方法のクレーム対応図である。It is a claim correspondence figure of the text processing method of the present invention. 本発明の文書処理方法のクレーム対応図である。It is a claim correspondence diagram of the document processing method of the present invention. 本発明のテキスト処理プログラムのクレーム対応図である。It is a claim correspondence diagram of the text processing program of the present invention. 本発明の情報処理プログラムのクレーム対応図である。It is a claim corresponding | compatible figure of the information processing program of this invention. 本発明の一実施の形態によるテキスト処理システムを使用した情報処理装置の構成を表わしたブロック図である。It is a block diagram showing the structure of the information processing apparatus using the text processing system by one embodiment of this invention. 本実施の形態でセグメント化部で分割する前の文書とセグメント化した文書を表わした平面図である。It is the top view showing the document before dividing | segmenting with the segmentation part in this Embodiment, and the document segmented. 本発明の第１の実施例におけるテキスト処理システムを使用した情報処理装置の構成を表わしたブロック図である。It is a block diagram showing the structure of the information processing apparatus using the text processing system in 1st Example of this invention. 第１の実施例のテキスト処理システムにおける行為用語抽出部の処理の様子を表わした流れ図である。It is a flowchart showing the mode of the process of the action term extraction part in the text processing system of a 1st Example. 図１０のステップＳ２０４で示した行為用語を特定する処理を表わした流れ図である。It is a flowchart showing the process which specifies the action term shown by step S204 of FIG. 第１の実施例における形態素解析の結果を示した説明図である。It is explanatory drawing which showed the result of the morphological analysis in a 1st Example. 第１の実施例で行為用語抽出部と行為ベクトル生成部の処理の様子を表わした説明図である。It is explanatory drawing showing the mode of the process of the action term extraction part and the action vector production | generation part in 1st Example. 第１の実施例でコサイン尺度を用いて各セグメントの行為ベクトルと統合ベクトルの類似度を計算した例を示す説明図である。It is explanatory drawing which shows the example which calculated the similarity of the action vector of each segment and an integrated vector using the cosine scale in 1st Example. 本発明の第２の実施例におけるテキスト処理システムを使用した情報処理装置の構成を表わしたブロック図である。It is a block diagram showing the structure of the information processing apparatus using the text processing system in 2nd Example of this invention. 第２の実施例における各セグメントの行為ベクトルと統合ベクトルの類似度を計算した例を示した説明図である。It is explanatory drawing which showed the example which calculated the similarity of the action vector and integrated vector of each segment in a 2nd Example. 本発明の第２の実施の形態によるテキスト処理システムを使用した情報処理装置の構成を表わしたブロック図である。It is a block diagram showing the structure of the information processing apparatus using the text processing system by the 2nd Embodiment of this invention. 本実施の形態で核セグメント判定部における隣接する複数のセグメントの行為ベクトルの和ベクトルを求める様子の一例を表わした説明図である。It is explanatory drawing showing an example of a mode which calculates | requires the sum vector of the action vector of the several adjacent segment in a nuclear segment determination part in this Embodiment. 本発明の第２の実施の形態におけるテキスト処理システムを使用した情報処理装置の構成の具体例を、第３の実施例として表わしたブロック図である。It is the block diagram showing the specific example of the structure of the information processing apparatus using the text processing system in the 2nd Embodiment of this invention as the 3rd Example.

Explanation of symbols

１０、２２、１１０、１１０Ａ、１１０Ｂ、１１０Ｃ、１１０Ｄテキスト処理システム
１１行為用語抽出手段
１２行為用語比較手段
１３セグメント判別手段
２０、１００、１００Ａ、１００Ｂ、１００Ｃ、１００Ｄ情報処理装置
２１セグメント分割手段
３０テキスト処理方法
３１、４２行為用語抽出ステップ
３２、４３行為用語比較ステップ
３３、４４セグメント判別ステップ
４０情報処理方法
４１セグメント分割ステップ
５０テキスト処理プログラム
５１、６２行為用語抽出処理
５２、６３行為用語比較処理
５３、６４セグメント判別処理
６０情報処理プログラム
６１セグメント分割処理
１０１ＣＰＵ
１０２、１０２Ａ、１０２Ｂ、１０２Ｃ、１０２Ｄメモリ
１０３、１０３Ａ、１０３Ｂ、１０３Ｃ、１０３Ｄ制御部
１０５文書集合部
１０７文書データ
１０８セグメントデータ
１１１、１１１Ａ、１１１Ｃ、１１１Ｄ核セグメント
１１２出力部
１２１、１２１Ａ行為用語抽出部
１２３、１２３Ａ行為ベクトル生成部
１２５、１４２行為ベクトル
１２６行為ベクトル記憶部
１２８核セグメント判定部
１４１行為用語リスト
１４３統合ベクトル
３０１核セグメント決定部
３１１和ベクトル 10, 22, 110, 110A, 110B, 110C, 110D Text processing system 11 Action term extraction means 12 Action term comparison means 13 Segment discrimination means 20, 100, 100A, 100B, 100C, 100D Information processing apparatus 21 Segment division means 30 Text Processing method 31, 42 Action term extraction step 32, 43 Action term comparison step 33, 44 Segment determination step 40 Information processing method 41 Segment division step 50 Text processing program 51, 62 Action term extraction process 52, 63 Action term comparison process 53, 64 Segment discrimination processing 60 Information processing program 61 Segment division processing 101 CPU
102, 102A, 102B, 102C, 102D Memory 103, 103A, 103B, 103C, 103D Control unit 105 Document collection unit 107 Document data 108 Segment data 111, 111A, 111C, 111D Core segment 112 Output unit 121, 121A Action term extraction unit 123, 123A Action vector generation unit 125, 142 Action vector 126 Action vector storage unit 128 Nuclear segment determination unit 141 Action term list 143 Integration vector 301 Nuclear segment determination unit 311 Sum vector

Claims

A term that is selected from segment data obtained by dividing text data constituting a single document into segments as a group of sentence ranges, and that expresses the intention of each sentence in a straightforward manner. Action term extraction means for extracting text information that matches any of the action terms for each segment,
Compare the segment-specific action terms as the action terms for each segment extracted by this action term extraction means with the integrated action terms as the action terms that integrate all the segments of the text data that constitutes the one document. Action term comparison means to
Segment discriminating means for discriminating a segment obtained by extracting a segment-specific action term most similar to the integrated act term from a comparison result of the act term comparing means as a segment that is a main part of the text data of the one document. A text processing system characterized by:

2. The candidate located at the tail of the sentence when there are a plurality of candidate action words extracted from one sentence to be processed by the action term extracting means, and the candidate located at the end of the sentence is defined as a segment-specific action term. Text processing system.

The action term extraction means includes morpheme analysis means for morphological analysis of one sentence to be processed, and among the morpheme analysis performed by the morpheme analysis means, a specific verb or a specific position located at the end of the one sentence The text processing system according to claim 2, wherein a noun is the action term for each segment.

The specific verb is a self-supporting verb excluding a predetermined specific self-supporting verb whose behavior is ambiguous, and the specific noun is an adjective verb stem as a noun that becomes a stem of general nouns, sa-variant connection nouns and adjective verbs The text processing system according to claim 3, wherein the text processing system is a noun.

Content that expresses the intention of the sentence in each segment simply by an action vector as a word vector, and content that expresses the sum of the action vectors of each segment as a whole the entire text data constituting the one document; The text processing system according to claim 1, wherein:

And means for normalizing the frequency information of the action vector of each segment and then creating an integrated vector which is the short content of the entire text data constituting the one document by the sum of these action vectors. The text processing system according to claim 5.

It comprises a nuclear segment determination means for determining a nuclear segment as a core segment by calculating a similarity between the action vector of each segment and an integrated vector for the text data constituting the one document. The text processing system according to claim 6.

8. The text processing system according to claim 7, wherein the nuclear segment determination means uses a cosine scale for calculating the similarity.

The text processing system according to claim 7, wherein the nuclear segment determination unit determines the nuclear segment by calculating an appearance frequency in a word vector of each segment of a word vector term of the entire text document.

The text processing system according to claim 9, wherein the nuclear segment determination unit uses an appearance probability in the calculation of the appearance frequency.

11. The text processing system according to claim 7, wherein the nuclear segment determination unit determines the segment having the maximum similarity or the appearance frequency as the nuclear segment.

11. The text processing system according to claim 7, wherein the nuclear segment determination unit determines that a segment whose similarity or appearance frequency is equal to or higher than a preset threshold is the nuclear segment. .

The text processing system according to claim 12, wherein a segment that is the largest of the segments that are equal to or greater than the threshold is determined as the nucleus segment.

13. The segment whose appearance position in the entire text data constituting the one document among the segments that are equal to or greater than the threshold is determined to be the nucleus segment. Text processing system.

When there is no segment whose similarity or appearance frequency is equal to or higher than the preset threshold, a sum vector as an action vector obtained by combining one or more adjacent segments is used as the text constituting the one document Sum vector calculation means for the segment at each position of interest in the entire data, and each sum vector calculated by the sum vector calculation means is compared with the similarity or the appearance frequency at a predetermined threshold value or more. Sum vector comparison means for discriminating whether a certain sum vector exists, and when there is a sum vector that is equal to or greater than a threshold in this sum vector comparison means, segments corresponding to any one of them are defined as core segments The text processing system according to claim 12.

Segment dividing means for dividing text data constituting one document into segments as a group of sentence ranges;
An information processing apparatus comprising the text processing system according to any one of claims 1 to 15.

The information processing apparatus according to claim 16, wherein at least a part of the text processing system and the segment dividing unit are connected by a communication network.

A term that is selected from segment data obtained by dividing text data constituting a single document into segments as a group of sentence ranges, and that expresses the intention of each sentence in a straightforward manner. An action term extraction step that extracts, for each segment, text information that matches any of the act terms as
Compare the action term by segment as the action term for each segment extracted in this action term extraction step with the integrated action term as the action term that integrates all segments of the text data that constitutes the one document. An act term comparison step,
A segment discriminating step for discriminating a segment obtained by extracting a segment-specific action term most similar to the integrated act term from the comparison result of the act term comparing step as a segment that is a main part of the text data of the one document. A text processing method characterized by:

A segment dividing step for dividing text data constituting one document into segments as a group of sentences;
Text information that matches one of the action terms as a term that is selected from the segment data divided into segments by this segment division step according to a predetermined rule and expresses the intention of each sentence. , An action term extraction step to extract for each segment;
Compare the action term by segment as the action term for each segment extracted in this action term extraction step with the integrated action term as the action term that integrates all segments of the text data that constitutes the one document. An act term comparison step,
A segment discriminating step for discriminating a segment obtained by extracting a segment-specific action term most similar to the integrated act term from the comparison result of the act term comparing step as a segment that is a main part of the text data of the one document. An information processing method characterized by:

On the computer,
A term that is selected from segment data obtained by dividing text data constituting a single document into segments as a group of sentence ranges, and that expresses the intention of each sentence in a straightforward manner. An action term extraction process that extracts text information that matches any of the action terms for each segment,
Compare the segment-specific action terms as the action terms for each segment extracted in this action term extraction process with the integrated action terms as the action terms that integrate all segments of the text data that constitutes the one document. Act term comparison processing,
A segment discriminating process for discriminating a segment obtained by extracting a segment-specific act term that is most similar to the integrated act term from the comparison result of the act term comparing process as a segment that is a main part of the text data of the one document. A text processing program characterized by causing

On the computer,
Segment division processing for dividing text data constituting one document into segments as a group of sentences,
Text information that matches one of the action terms as a term that is selected from the segment data divided into segments by this segment division step according to a predetermined rule and expresses the intention of each sentence. , Action term extraction processing to extract for each segment;
Compare the segment-specific action terms as the action terms for each segment extracted in this action term extraction process with the integrated action terms as the action terms that integrate all segments of the text data that constitutes the one document. Act term comparison processing,
A segment discriminating process for discriminating a segment obtained by extracting a segment-specific act term that is most similar to the integrated act term from a comparison result of the act term comparing process as a segment that is a main part of the text data of the one document. An information processing program characterized by causing