JP3017098B2

JP3017098B2 - Natural language processing apparatus and method

Info

Publication number: JP3017098B2
Application number: JP8221024A
Authority: JP
Inventors: 暉将江原; 則好浦谷; 淵培金
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1996-08-22
Filing date: 1996-08-22
Publication date: 2000-03-06
Anticipated expiration: 2016-08-22
Also published as: JPH1063658A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は自然言語処理装置お
よび方法に関し、特に、計算機を用いて文書の要約や抄
録を自動的に作成するための自然言語処理装置および方
法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a natural language processing apparatus and method, and more particularly to a natural language processing apparatus and method for automatically creating a document summary or abstract using a computer.

【０００２】[0002]

【従来の技術】従来、計算機を用いて文書の要約（内容
の主な点を短くまとめること。また、まとめたもの。）
や抄録（ぬき書きすること。また、そのぬき書き。）を
自動的に行なう自然言語処理装置としては、以下の
（ａ），（ｂ），（ｃ）の３種類の装置があった。な
お、以降は、要約と抄録を合わせて、要約と言う。2. Description of the Related Art Heretofore, summarizing a document using a computer (to summarize the main points of the content briefly.
There are the following three types of natural language processing apparatuses that automatically perform abstracting and abstracting (to write out and to write out) and the following (a), (b), and (c). Hereinafter, the abstract and the abstract are collectively referred to as an abstract.

【０００３】（ａ）第１種の装置は、入力される原文の
意味を計算機を用いて理解し、その結果から要約の文章
を生成するものである。この方法は、計算機による意味
の理解が複雑な処理であるため、対象が短文に限られ
る。そのため、長文を含む文書の場合、意味の理解がで
きないか、誤った理解をしてしまい、要約ができない
か、要約ができても要約精度が悪化するという問題点が
ある。(A) The first type of device is to understand the meaning of an input original sentence using a computer and generate a summary sentence from the result. In this method, since the understanding of the meaning by the computer is a complicated process, the target is limited to short sentences. Therefore, in the case of a document including a long sentence, there is a problem that the meaning cannot be understood or misunderstood, and the summarization cannot be performed, or even if the summarization can be performed, the summarization accuracy is deteriorated.

【０００４】（ｂ）第２種の装置は、文と文との接続関
係を利用するものであり、（ａ）の装置と比較して簡単
な処理で要約が可能であるという利点がある。しかし、
文と文との接続関係を用いているので、長文を含み文書
全体の文の数が少数である場合には、要約文の切り出し
が荒すぎて、この場合も要約精度が悪化する。[0004] (b) The second type of device utilizes the connection relationship between sentences, and has the advantage that summarization can be performed by simple processing compared to the device of (a). But,
Since the connection relation between sentences is used, if the number of sentences in the entire document including the long sentence is small, the extraction of the summary sentence is too rough, and the summarization accuracy also deteriorates in this case.

【０００５】（ｃ）第３種の装置は、文書の中で重要な
役割を果たしている語（キーワード）を抽出し、キーワ
ードを多く含む部分を要約として取り出すものである。
しかし、長文では文の一部にのみキーワードが集中する
という現象が生じることがある。この種の装置を長文に
適用して、このような文を要約文として切り出すと、重
要な部分とそうでない部分が混在して、要約精度が悪化
するという問題点がある。(C) The third type of device extracts words (keywords) that play an important role in a document, and extracts a portion containing many keywords as a summary.
However, in a long sentence, a phenomenon may occur in which keywords are concentrated only in a part of the sentence. If this type of device is applied to a long sentence and such a sentence is cut out as a summary sentence, there is a problem that important parts and non-important parts are mixed and the summarization accuracy is deteriorated.

【０００６】ここで要約精度とは、主観的に評価した場
合の要約の完成度のことである。[0006] Here, the summarization accuracy refers to the degree of perfection of the summaries when subjectively evaluated.

【０００７】[0007]

【発明が解決しようとする課題】このように、上記のよ
うな従来の各種の自然言語処理装置では、（ａ），
（ｂ），（ｃ）に記載したように、長文を含む文書の要
約または抄録を行なう場合、要約精度が悪化するという
共通の問題点があった。As described above, in the various conventional natural language processing devices as described above, (a),
As described in (b) and (c), when summarizing or abstracting a document including a long sentence, there is a common problem that the summarization accuracy is deteriorated.

【０００８】本発明は、上述の点に鑑みてなされたもの
で、長文を含む文書に対しても精度良く要約を作成する
ことのできる自然言語処理装置および方法を提供するこ
とを目的としている。SUMMARY OF THE INVENTION The present invention has been made in view of the above points, and has as its object to provide a natural language processing apparatus and method capable of creating a summary with high accuracy even for a document including a long sentence.

【０００９】[0009]

【課題を解決するための手段】上記目的を達成するため
に、請求項１に記載の本発明の装置では、入力文書から
その内容に応じた要約文書を自動的に生成する自然言語
処理装置において、前記入力文書から情報素列を求めて
所定の分割パターンと照合することで分割点を決定し、
前記入力文書を複数の短文に分割する分割手段と、分割
された前記複数の短文から所定のパラメータに基づき所
定数の短文を採用した要約文を生成し、採用された短文
と連続する他の短文が採用されているか否かに応じて前
記分割点の前後の短文に語句の調整を行うことで前記要
約文書を文として完成した形式で出力する要約生成手段
とを具備した構成とした。According to a first aspect of the present invention, there is provided a natural language processing apparatus for automatically generating a summary document according to the contents from an input document. Determining a division point by obtaining an information element sequence from the input document and comparing it with a predetermined division pattern,
Dividing means for dividing the input document into a plurality of short sentences; generating a summary sentence employing a predetermined number of short sentences based on a predetermined parameter from the plurality of divided short sentences; and other short sentences continuous with the adopted short sentences. And a summary generating unit that outputs the summary document as a sentence in a completed form by adjusting words in short sentences before and after the division point according to whether or not is adopted.

【００１０】[0010]

【００１１】[0011]

【００１２】また、請求項２に記載の本発明の装置で
は、前記要約生成手段は、語句変更規則を利用して前記
要約文の連結点の語句を変更または追加または削除する
ことにより前記語句の調整を行い、前記要約文書を出力
する構成とした。In the apparatus according to the second aspect of the present invention, the summary generating means changes, adds, or deletes a word at a connection point of the summary sentence by using a word change rule, whereby the word of the word is changed. After the adjustment, the summary document is output.

【００１３】また、請求項３に記載の本発明の装置で
は、前記連結点の語句の調整を行なうために前記入力文
書と前記要約文の両方を利用し、前記入力文書の分割部
分に含まれる文の結束性、時制、法の情報の少なくとも
いずれか一つの情報を前記要約文書に含める構成とし
た。In the apparatus according to the third aspect of the present invention, both of the input document and the summary sentence are used to adjust the word at the connection point, and are included in the divided portion of the input document. At least one of the cohesiveness of the sentence, tense, and legal information is included in the summary document.

【００１４】上記目的を達成するために、請求項４に記
載の本発明の方法では、入力文書からその内容に応じた
要約文書を自動的に生成する自然言語処理方法におい
て、前記入力文書から情報素列を求めて所定の分割パタ
ーンと照合することで分割点を決定し、前記入力文書を
複数の短文に分割する分割ステップと、分割された前記
複数の短文から所定のパラメータに基づき所定数の短文
を採用した要約文を生成し、採用された短文と連続する
他の短文が採用されているか否かに応じて前記分割点の
前後の短文に語句の調整を行うことで前記要約文書を文
として完成した形式で出力する要約生成ステップとを含
む構成とした。According to a fourth aspect of the present invention, there is provided a natural language processing method for automatically generating a summary document according to the content from an input document. Determining a division point by obtaining a sequence and comparing it with a predetermined division pattern; dividing the input document into a plurality of short sentences; and a predetermined number of divisions based on a predetermined parameter from the plurality of divided short sentences. A summary sentence that employs a short sentence is generated, and the word is adjusted in the short sentences before and after the division point in accordance with whether or not another short sentence that is continuous with the adopted short sentence is adopted, so that the summary document is sentenced. And a summary generation step of outputting in a completed format.

【００１５】[0015]

【００１６】[0016]

【００１７】また、請求項５に記載の本発明の方法で
は、前記要約生成ステップにおいて、語句変更規則を利
用して前記要約文の連結点の語句を変更または追加また
は削除することにより前記語句の調整を行い、前記要約
文書を出力する構成とした。In the method according to the present invention, in the summary generation step, a phrase at a connection point of the summary sentence is changed or added or deleted by using a phrase change rule, thereby forming the phrase. After the adjustment, the summary document is output.

【００１８】また、請求項６に記載の本発明の方法で
は、前記入力文書と前記要約文の両方を利用して前記連
結点の語句の調整を行ない、前記入力文書の分割部分に
含まれる文の結束性、時制、法の情報の少なくともいず
れか一つの情報を前記要約文書に含める構成とした。Further, in the method according to the present invention, the phrase at the connection point is adjusted using both the input document and the summary sentence, and the sentence included in the divided portion of the input document is adjusted. At least one of the cohesiveness, tense, and legal information is included in the summary document.

【００１９】[0019]

【発明の実施の形態】以下、図面を参照しながら本発明
の実施の形態を詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００２０】（第１の実施の形態）図１は本発明を適用
した自然言語処理装置の第１の実施の形態の基本的な処
理手順を示す流れ図であり、本発明装置を用いて入力文
書Ｓｉｎから要約文書Ｓａｂを作成する過程を示す。(First Embodiment) FIG. 1 is a flow chart showing a basic processing procedure of a first embodiment of a natural language processing apparatus to which the present invention is applied. The process of creating a summary document Sab from Sin is shown.

【００２１】本発明装置は、インターフェイス、制御・
演算装置、記憶装置等を含む一般的な構成のコンピュー
タ装置により構成することができる。この記憶装置に
は、単語辞書などと共に後述する分割パターンファイ
ル、図１〜図４のプログラムが用意されている。このプ
ログラムに基づき動作する本発明装置は、機能的には、
文書入力部１０と自動短文分割部１２と自動要約部１４
と連結点（分割点）調整部１６と要約文書出力部１８の
各ブロックで表現できる。この文書入力部１０は、要約
を行おうとする入力文書Ｓｉｎ「特別査察をＸＸ国が拒
否している以上、ＩＡＥＡが……ことになった」を入力
し記憶するもので、たとえば、キーボード、スキャナ、
音声入力装置とＲＡＭ等で構成される。The device of the present invention comprises an interface,
It can be configured by a computer device having a general configuration including an arithmetic device, a storage device, and the like. This storage device is provided with a divided pattern file, which will be described later, along with a word dictionary and the like, and the programs shown in FIGS. The device of the present invention that operates based on this program is functionally
Document input unit 10, automatic short sentence division unit 12, and automatic summarization unit 14
, And a connection point (division point) adjustment unit 16 and a summary document output unit 18. The document input unit 10 inputs and stores an input document Sin for which a summary is to be made, since the XX country has rejected the special inspection, the IAEA has been decided. ,
It is composed of a voice input device and a RAM.

【００２２】自動短文分割部１２は、要約を行う前に、
長文を含む入力された入力文書Ｓｉｎを複数の短文に自
動短文分割した文書Ｓｓｈを出力する。自動要約部１４
は、短文分割された文書Ｓｓｈを要約した要約文Ｓｓａ
ｂを出力する。連結点（分割点）調整部１６は、入力文
書Ｓｉｎの分割点である要約文Ｓｓａｂの連結点を調整
する。この連結点（分割点）調整部１６には、自動要約
部１４の出力である要約文Ｓｓａｂの他に、文書入力部
１０の出力である要約する前の原文（入力文書Ｓｉｎ）
も入力され、両者の情報を利用して要約文Ｓｓａｂの連
結点を調整する。自動短文分割部１２と自動要約部１４
と連結点（分割点）調整部１６は、ＣＰＵなどの制御・
演算装置がプログラムＲＯＭに記憶されたプログラムを
読み出してＲＡＭを作業エリアとして使用して制御・演
算することで実現できる。Before summarizing, the automatic short sentence segmentation unit 12
A document Ssh obtained by automatically dividing an input document Sin including a long sentence into a plurality of short sentences is output. Automatic summarization unit 14
Is a summary sentence Ssa that summarizes the short sentence-divided document Ssh.
b is output. The connection point (division point) adjustment unit 16 adjusts the connection point of the abstract sentence Ssab, which is the division point of the input document Sin. The connection point (division point) adjustment unit 16 includes, in addition to the summary sentence Ssab output from the automatic summarization unit 14, the original sentence before summarization (input document Sin) output from the document input unit 10.
Is also input, and the connection point of the summary sentence Ssab is adjusted using the information of both. Automatic short sentence division unit 12 and automatic summarization unit 14
And the connection point (division point) adjustment unit 16 controls the CPU and the like.
This can be realized by an arithmetic unit reading out a program stored in the program ROM and performing control and arithmetic using the RAM as a work area.

【００２３】図１では、自動短文分割部１２は、入力文
書Ｓｉｎを３つの短文Ｓ１「特別査察を……してい
る。」、Ｓ２「ＩＡＥＡが得ている……できない。」、
Ｓ３「核査察協定違反……ことになった。」に分割した
文書Ｓｓｈを出力する。自動要約部１４は、短文Ｓ１，
Ｓ２，Ｓ３からなる入力文書Ｓｉｎを要約して２つの短
文Ｓ１「特別査察を……している。」、Ｓ２「ＩＡＥＡ
が得ている……ことはできない。」を採用し、短文Ｓ
１，Ｓ２からなる要約文Ｓｓａｂを自動的に出力してい
る。In FIG. 1, the automatic short sentence division unit 12 converts the input document Sin into three short sentences S1 "special inspection is performed.", S2 "IAEA is obtained...
The document Ssh divided into S3 "violation of the nuclear inspection agreement ... it has been decided" is output. The automatic summarizing unit 14 generates a short sentence S1,
The input document Sin made up of S2 and S3 is summarized into two short sentences S1 "I am doing a special inspection ..." and S2 "IAEA."
Is gaining ... I can't do that. , And the short sentence S
The summary sentence Ssab composed of S1 and S2 is automatically output.

【００２４】連結点（分割点）調整部１６は、短文Ｓ
１，Ｓ２からなる要約文Ｓｓａｂの連結点から句
点（。）を削除し、語句「以上、」を挿入することで、
調整を行い、要約文Ｓａｂを出力する。要約文書出力部
１８は、連結点（分割点）の調整が済んだ要約文書Ｓａ
ｂを出力するもので、たとえば、表示装置、プリンタ、
音声出力装置等で構成される。The connection point (division point) adjustment unit 16
By deleting the period (.) From the connection point of the abstract sentence Ssab composed of 1, 1 and S2 and inserting the word “or more,”
After the adjustment, the summary sentence Sab is output. The summary document output unit 18 outputs the summary document Sa whose connection points (division points) have been adjusted.
b, for example, a display device, a printer,
It is composed of an audio output device and the like.

【００２５】図２は自動短文分割部１２による詳細な処
理手順を示す流れ図である。FIG. 2 is a flowchart showing a detailed processing procedure by the automatic short sentence division unit 12.

【００２６】この例の場合は、短文分割処理に形態素解
析と構文解析を用いている。入力された文書Ｓｉｎ「特
別査察をＸＸ国が拒否している以上、ＩＡＥＡ……こと
になった」は、まずステップＳ２０において形態素解析
され、単語の認定が行なわれ、各単語の表現形、標準
形、品詞、意味分類番号の情報が付与される。形態素解
析とは、日本語文書を単語に分割し、各単語のもつ属性
を付与する処理であり、その技術的手法は周知であり、
文献１（“新版情報処理ハンドブック１４編「自然言語
処理」”：情報処理学会編：オーム社、１９９５）に詳
しいので詳細な説明は省略する。In this example, morphological analysis and syntax analysis are used for the short sentence division processing. The input document Sin “IAEA... Has been decided because the XX country has rejected the special inspection” is subjected to morphological analysis in step S20, word recognition is performed, and the expression form, standard Information on the form, part of speech, and semantic classification number is added. Morphological analysis is a process in which a Japanese document is divided into words and the attributes of each word are assigned, and the technical methods are well known.
The detailed description is omitted in Document 1 ("New Language Information Processing Handbook, 14th Edition," Natural Language Processing "": Information Processing Society of Japan, Ohmsha, 1995).

【００２７】形態素解析について簡略的に説明すると、
べた書きの日本語文を構成する要素のうちで最小の意味
の単位を担っていると考えられる形態素の切れ目を認定
し、形態素の品詞を認定することがその主な目的であ
る。形態素解析にはいくつかの方式があるが、ここでは
学校文法方式による例を示している。たとえば、図示の
Ｓ４のように形態素の切れ目を認定して、たとえば「特
別」には品詞として［ａｄｖ］が、「査察」には品詞と
して［ｎｏｕｎ］が、「を」には品詞として格助詞［ｃ
ａｓｅ］が、「拒否し」には品詞として［ｖｅｒｂ］が
付与される。The morphological analysis will be described briefly.
Its main purpose is to recognize the breaks of morphemes that are considered to have the smallest meaning unit among the elements that make up a solid written Japanese sentence, and to recognize the part of speech of morphemes. Although there are several methods for morphological analysis, an example using a school grammar method is shown here. For example, morpheme breaks are recognized as shown in S4 in the figure. For example, [adv] as a part of speech for "special", [noun] as a part of speech for "inspection", and a case particle as a part of speech for "wo" [C
ase], and “verb” is given as a part of speech to “reject”.

【００２８】次に、ステップＳ２２において文節内構文
解析されて、文節の内部構造が決定される。文節内構文
解析とは、形態素列に対して文節境界を決め、文節のも
つ属性を付与する処理であり、その技術的手法は周知で
あり、本出願人らによる文献２（“日英機械翻訳のため
の日本語長文自動短文分割と主語の補完”：金淵培、江
原暉将：情報処理学会論文誌、Ｖｏｌ．３５，Ｎｏ６，
ｐｐ１０１８−１０２８，Ｊｕｎｅ，１９９４）に情報
素列の抽出として詳述されているので、詳細な説明は省
略する。Next, in step S22, the syntax within the clause is analyzed to determine the internal structure of the clause. The intra-clause parsing is a process of determining a bunsetsu boundary for a morpheme sequence and assigning attributes of the bunsetsu, and its technical method is well known. Automatic Japanese Sentence Short Text Segmentation and Subject Completion for Strategy ": K. Kanabuchi, H. Ehara: Transactions of Information Processing Society of Japan, Vol. 35, No. 6,
pp. 1018-1028, June, 1994), which is described in detail as the extraction of the information element sequence, and a detailed description thereof will be omitted.

【００２９】文節内構文解析について簡略的に説明する
と、形態素解析して得た形態素情報から４種類の情報素
列（表面素列、標準素列、記号素列、短文素列）、すな
わち属性情報や結合情報を付加された入力文書Ｓｉｎに
対応する素列を得、入力文書Ｓｉｎに対応する情報素列
と文節内文法とのマッチングによって文節境界を認定す
ることで図示のＳ５のように文節境界ｂ１，ｂ２，ｂ
３，ｂ４等を決め、情報素列の結合情報を得るものであ
る。Briefly describing the syntax analysis within a clause, four types of information element strings (surface element strings, standard element strings, symbol element strings, and short element strings) from morphological information obtained by morphological analysis, that is, attribute information A sequence corresponding to the input document Sin to which the input document Sin has been added and the information sequence corresponding to the input document Sin and the grammar in the phrase are recognized to identify a phrase boundary, thereby obtaining a phrase boundary as shown in S5 in FIG. b1, b2, b
3, b4 and the like are determined to obtain the combined information of the information element sequence.

【００３０】次に、ステップＳ２４において、複数の文
節を結合し、節の範囲を決定する簡易文節間構文解析が
行なわれる。文節結合（簡易文節間構文解析）は、形態
素解析と文節内構文解析で得られた結合情報を利用し
て、文節と文節を結合すべきか分離すべきかを決定する
処理であり、文献２にグルーピングとして詳述されてい
るとおりの周知の技術的手法なので、ここでは詳細な説
明は省略する。Next, in step S24, a simple inter-segment parsing is performed to combine a plurality of clauses and determine the range of the clause. The phrase connection (simplified inter-phrase parsing) is a process of determining whether or not a phrase and a phrase should be combined or separated by using connection information obtained by morphological analysis and intra-phrase parsing. Since it is a well-known technical method as described in detail above, a detailed description thereof will be omitted here.

【００３１】簡易文節間構文解析について簡略的に説明
すると、文節内構文解析で得られた記号素上で文頭から
文末へ前進しながらグループを作っていくもので、この
グループのうち連用文節で終わるグループまでの複数の
文節を結合して節の範囲を決定し、この連用文節毎に分
割点候補を決定するものである。図２では、Ｓ６のよう
に「特別査察をＸＸ国が拒否している以上、」、「ＩＡ
ＥＡが得ているデータからは核物質の利用目的をこれ以
上検証することはできず、」、「核査察協定違反の決議
を採択することになった。」の３つの節に分割すること
が候補として示されている。「／」は分割点の位置を示
している。Briefly describing the simple inter-clause parsing, a group is formed by moving forward from the beginning of the sentence to the end of the sentence on the symbol obtained by the intra-clause parsing, and ends with a continuous phrase in this group. A plurality of clauses up to the group are combined to determine the range of the clause, and a division point candidate is determined for each of the consecutive clauses. In FIG. 2, as in S6, "since the XX country refuses the special inspection,""IA"
The data obtained by the EA cannot be used to further verify the purpose of the use of nuclear material, "and" it has adopted a resolution violating the Nuclear Inspection Agreement. " Shown as a candidate. “/” Indicates the position of the division point.

【００３２】次に、ステップＳ２６において、入力文書
Ｓｉｎの情報素列は分割点決定処理される。この分割点
決定処理は、入力文書Ｓｉｎの情報素列と、分割パター
ンファイルＦｄｖに予め設定記憶されている複数の分割
パターンとを文頭から文末方向へ照合していき、分割す
べき点を求める処理である。分割パターンは、分割すべ
き点、あるいは分割すべきでない点の情報素列パターン
を、入力文書Ｓｉｎの情報素列と同様の形式で表現した
ものである。Next, in step S26, the information element sequence of the input document Sin is subjected to division point determination processing. This division point determination processing is a process of comparing the information element sequence of the input document Sin with a plurality of division patterns set and stored in the division pattern file Fdv in advance from the beginning of the sentence to the end of the sentence, and obtaining a point to be divided. It is. The division pattern expresses an information element sequence pattern of a point to be divided or a point not to be divided in the same format as the information element sequence of the input document Sin.

【００３３】たとえば、「動詞の連用形＋接続助詞＋読
点」というパターンが分割すべき点として分割パターン
ファイルＦｄｖの中にあり、入力文書Ｓｉｎの情報素列
の一部に、同様のパターンがある場合は、「読点」のあ
とを分割点として決定する。また、「あわせる＋て＋読
点」というパターンが分割すべきでない点として分割パ
ターンファイルＦｄｖの中にあり、入力文書Ｓｉｎの情
報素列の一部に同様のパターンがある場合は、「読点」
の後では分割決定しない。この分割点決定の技術的手法
は周知であり、文献２にパターンマッチングとして詳述
されているので、ここでは詳細な説明を省略する。For example, in the case where the pattern “verb conjunctive form + conjunctive particle + reading point” is to be divided in the divided pattern file Fdv, and a similar pattern exists in a part of the information element sequence of the input document Sin. Determines the division point after the "reading point". Also, if the pattern “matching + te + reading point” should not be divided in the divided pattern file Fdv, and there is a similar pattern in a part of the information element sequence of the input document Sin, the “reading point”
Is not determined after. The technical method for determining the division point is well known, and is described in detail in Document 2 as pattern matching, and therefore detailed description is omitted here.

【００３４】本実施の形態では、自動短文分割に形態素
解析と構文解析を用いた例について説明したが、これら
の解析に加えて、字面解析、意味解析、文脈解析の全部
の解析、またはいずれかの任意の解析を用いて自動短文
分割処理を施してもよい。これらの解析手法は周知であ
り、文献１，文献３（“日本語文章推敲支援ツール「推
敲」における字面解析手法とその評価”：菅沼明ほか：
情報処理学会自然言語処理研究会資料、６８−８，１９
８８）に詳しいので、ここでは詳細な説明は省略する。In the present embodiment, an example has been described in which morphological analysis and syntactic analysis are used for automatic short sentence segmentation. In addition to these analyses, character analysis, semantic analysis, context analysis, or any one of The automatic short sentence segmentation process may be performed using any analysis of. These analysis methods are well-known, and are described in References 1 and 3 (“Characteristic analysis method and its evaluation in Japanese text revision support tool“ Revision ”): Akira Suganuma et al .:
IPSJ Natural Language Processing Workshop Material, 68-8, 19
88), so a detailed description is omitted here.

【００３５】続いて、入力文書ＳｉｎはステップＳ２６
の分割点決定処理の結果に従ってステップＳ２８におい
て分割処理された後、短文Ｓ１，Ｓ２，Ｓ３に分割され
た文書Ｓｓｈとして出力される。Subsequently, the input document Sin is stored in step S26.
Is divided in step S28 in accordance with the result of the division point determination processing of step S28, and is output as a document Ssh divided into short sentences S1, S2, and S3.

【００３６】文書Ｓｓｈの出力に際しては、ステップＳ
２４で得られた各節の文末の用言を終止形にする処理を
行ない、文としての形態を仮に整える処理が行われる。
また、逆接の接続助詞が分割点に含まれる場合は、分割
点の後の文に、逆接の接続詞を挿入するなどの処理を施
し、接続情報をできるだけ文書Ｓｓｈに含める処理がな
される。このように、文書Ｓｓｈの出力時に行なわれる
処理は文の結束性に関する処理のみであり、結果として
図示の文書Ｓｓｈ「特別査察をＸＸ国が拒否している。
ＩＡＥＡが得ているデータからは核物質の利用目的をこ
れ以上検証することはできない。核査察協定違反の決議
を採択することになった。」が出力される。When outputting the document Ssh, step S
A process of closing the sentence end sentence of each clause obtained in step 24 is performed, and a process of temporarily adjusting the form as a sentence is performed.
If the conjunctive conjunctive particle is included in the division point, a process such as inserting a conjunctive conjunctive into the sentence after the division point is performed, and the connection information is included in the document Ssh as much as possible. As described above, the process performed when the document Ssh is output is only the process regarding the cohesiveness of the sentence, and as a result, the document XX rejects the special inspection shown in the illustrated document Ssh.
The data obtained by the IAEA cannot be used to further verify the intended use of nuclear material. It has adopted a resolution violating the nuclear inspection agreement. Is output.

【００３７】図３は自動要約部１４による詳細な処理手
順を示す流れ図である。FIG. 3 is a flowchart showing a detailed processing procedure by the automatic summarizing section 14.

【００３８】自動要約部１４では、既存のいかなる要約
手法を用いても良いが、従来の手法のうち（Ｃ）の手法
で実施した例を図３に示している。The automatic summarizing section 14 may use any existing summarizing method. FIG. 3 shows an example in which the method (C) is used among the conventional methods.

【００３９】まずステップＳ３２において、短文分割処
理された文書Ｓｓｈの中からキーワードをパラメータと
して抽出し、ステップＳ３４において文書Ｓｓｈに含ま
れるキーワードの数が最も多い短文から予め定められた
数の短文を選択して採用し、ステップＳ３６において要
約文Ｓｓａｂを出力する。ここで、キーワードは予め設
定されたキーワードリストの中のものを用いても良く、
あるいは、文書Ｓｓｈ中の単語の使用頻度などを用い
て、その都度ダイナミックに決定しても良い。こここ
は、前者の手法により「ＸＸ国」、「ＩＡＥＡ」をキー
ワードとして抽出することにより要約としてふさわしい
短文Ｓ１，Ｓ２を選択し、キーワードが含まれない他の
短文Ｓ３「核査察協定違反の決議を採択することになっ
た。」は要約文Ｓｓａｂ中に採用されない。First, in step S32, keywords are extracted as parameters from the short sentence-divided document Ssh, and in step S34, a predetermined number of short sentences are selected from the short sentences in which the number of keywords included in the document Ssh is the largest. The summary sentence Ssab is output in step S36. Here, the keyword may be a keyword in a preset keyword list,
Alternatively, it may be dynamically determined each time using the frequency of use of words in the document Ssh. Here, the former method extracts "XX country" and "IAEA" as keywords, selects short sentences S1 and S2 suitable as a summary, and selects other short sentences S3 that do not include the keyword "Resolution of violation of nuclear inspection agreement" Is not adopted in the summary sentence Ssab.

【００４０】上記のように長文を自動短文分割した場
合、分割点は入力文書Ｓｉｎの途中であるために、分割
したそのままの表現では、文としての形態を満足しな
い。そこで、要約を作成するときに分割点の語句を調整
する必要がある。When a long sentence is automatically divided into short sentences as described above, since the division point is in the middle of the input document Sin, the divided expression does not satisfy the sentence form. Therefore, it is necessary to adjust the words at the division points when creating the summary.

【００４１】図４は連結点（分割点）調整部１６による
詳細な処理手順を示す流れ図であり、ここでは、分割点
の語句の調整が行われる。FIG. 4 is a flow chart showing a detailed processing procedure by the connection point (division point) adjusting section 16, in which the word at the division point is adjusted.

【００４２】語句の調整とは、例えば、分割点の前の部
分からなる短文を要約文として採用する場合、その短文
に対し、当該分割点に付属している接続助詞を削除して
用言の終止形とし、文として完成した形式とする処理
や、分割点の後の部分からなる短文を要約文として採用
する場合、その短文に対し、削除した接続助詞の意味に
相当する接続詞を追加して、原文にあった接続助詞の意
味を分割された短文に含める処理である。For example, when a short sentence consisting of a part before a division point is adopted as a summary sentence, the adjustment of the phrase is performed by deleting the connecting particle attached to the division point from the short sentence, and If a short sentence consisting of the part after the dividing point is adopted as a summary sentence, or if a short sentence consisting of the part after the division point is adopted as a summary sentence, add a connective equivalent to the meaning of the deleted connective particle to the short sentence. In this process, the meaning of the connecting particle in the original sentence is included in the divided short sentence.

【００４３】図４を参照して具体的に説明すると、まず
ステップＳ４０において、入力文書Ｓｉｎが入力され、
ステップＳ４１において、自動要約部１４の出力である
要約文Ｓｓａｂが入力される。連結点（分割点）調整部
１６の処理では、続くステップＳ４２において、調整し
ようとする連結点を鋏む前後の２つの文が入力文書Ｓｉ
ｎの中で連続しているか判定する。換言すると、要約文
Ｓｓａｂに採用された複数に分割された短文Ｓ１，Ｓ２
は、入力文書Ｓｉｎでは連続していた２つの短文が要約
文Ｓｓａｂとして採用されたものかどうかを判定し、こ
の判定結果に応じて以降の処理系統が分岐する。More specifically, referring to FIG. 4, first, in step S40, the input document Sin is input,
In step S41, the summary sentence Ssab output from the automatic summarizing unit 14 is input. In the processing of the connection point (division point) adjustment unit 16, in the following step S42, the two sentences before and after the connection point to be adjusted are input document Si.
It is determined whether n is continuous. In other words, a plurality of divided short sentences S1, S2 adopted in the summary sentence Ssab
Determines whether two short sentences that were continuous in the input document Sin have been adopted as the summary sentence Ssab, and the subsequent processing system branches according to the determination result.

【００４４】短文Ｓ１のように後続の短文Ｓ２が採用さ
れている場合は、ステップＳ４４の分割部分を利用した
調整を行う。すなわち、要約文Ｓｓａｂに採用された複
数の短文Ｓ１，Ｓ２の中で、入力文書Ｓｉｎで連続して
いた隣接する短文Ｓ１，Ｓ２が要約文Ｓｓａｂとして採
用されている短文Ｓ１，Ｓ２の分割点に対しては、両者
の連結点（入力文Ｓｉｎで言えば、分割点）の語句「し
ている以上、」を入力文書Ｓｉｎから変更せずに入力文
書Ｓｉｎの連結形態に戻す調整を行なう。When the subsequent short sentence S2 is adopted as in the short sentence S1, the adjustment using the divided portion in step S44 is performed. That is, among the plurality of short sentences S1 and S2 adopted in the summary sentence Ssab, the adjacent short sentences S1 and S2 that have been continuous in the input document Sin are set to the dividing points of the short sentences S1 and S2 adopted as the summary sentence Ssab. On the other hand, an adjustment is made so as to return to the connected form of the input document Sin without changing the phrase “does more than or equal to” at the connection point (division point in terms of the input sentence Sin) of the input document Sin.

【００４５】一方、短文Ｓ２のように後続の短文Ｓ３が
採用されなかった場合は、ステップＳ４６の文末部分を
利用した語句の調整を行う。すなわち、短文Ｓ２に対
し、入力文書Ｓｉｎの文末の時制や法の情報「すること
になった。」を付加する処理を行なって最終的な要約文
書Ｓａｂとして出力する。このような処理は、分割点の
語句の調整に入力文書Ｓｉｎの情報を用いることで可能
である。こうすることで、入力文書Ｓｉｎの持つ文の結
束性や時制や法の情報を要約文書Ｓａｂに精度良く保つ
ことが可能となる。ここで、文章の結束性とは、接続助
詞など文と文の関係を表す語句の持つ意味のことであ
り、時制とは、過去・現在・未来など文に含まれる時の
概念であり、法とは、願望や疑問など文に含まれる筆者
の主観的意味のことである。On the other hand, if the subsequent short sentence S3 is not adopted as in the short sentence S2, the word / phrase is adjusted using the end of the sentence in step S46. In other words, the short sentence S2 is subjected to a process of adding information on the tense or the law at the end of the sentence Sin of the input document Sin, and the result is output as a final summary document Sab. Such a process can be performed by using the information of the input document Sin to adjust the word at the division point. By doing so, it is possible to accurately maintain information on the cohesiveness, tense and law of the sentence of the input document Sin in the summary document Sab. Here, the cohesiveness of a sentence is the meaning of a phrase that expresses the relation between sentences, such as connecting particles, and the tense is a concept that is included in a sentence, such as the past, present, or future. Is the author's subjective meaning in a sentence, such as a desire or question.

【００４６】ステップＳ４４，Ｓ４６を終了するとステ
ップＳ４８に進み、連結点調整済みの要約文書Ｓａｂを
出力する。When steps S44 and S46 are completed, the flow advances to step S48 to output a summary document Sab with connection points adjusted.

【００４７】上記のとおり連結点（分割点）調整部１６
では、要約文Ｓｓａｂの連結点の語句を上記のような文
の結束性や時制や法の情報を考慮した語句変更規則を利
用して、変更、追加、または削除して文としての完成し
た形式の短文の組み合わせとして要約文書Ｓａｂを出力
する。As described above, the connection point (division point) adjustment unit 16
Then, the phrase at the connection point of the abstract sentence Ssab is changed, added, or deleted by using the above-described phrase change rule in consideration of the cohesiveness of the sentence, tense and legal information, and the completed form as a sentence The summary document Sab is output as a combination of the short sentences.

【００４８】このように本実施の形態によれば原文（入
力文書Ｓｉｎ）の要約を行なう前に入力文書Ｓｉｎに図
２の自動短文分割処理を施して、長文の要約を複数の短
文を組み合わせたものの要約に置き換えることで、要約
の精度を向上させることができる。さらに、短文の文
頭、文末表現の調整を行なうことで、文としての形態を
保つことを可能にすることができる。さらに、分割する
前の原文の分割点の情報を利用して短文分割部分の文
頭、文末表現の語句の調整を行うことで、原文の持つ文
の結束性や時制や法の情報を要約文書に含めることが可
能になる。したがって、入力文書Ｓｉｎの持つ結束性や
時制、法の情報を要約文書Ｓａｂに高精度で保つことを
実現可能とするものである。As described above, according to the present embodiment, before summarizing the original sentence (input document Sin), the input document Sin is subjected to the automatic short sentence division processing shown in FIG. 2 to combine a long sentence summarization with a plurality of short sentences. By substituting summaries of things, the accuracy of summarization can be improved. Furthermore, by adjusting the beginning and end expressions of a short sentence, it is possible to maintain the sentence form. Furthermore, by using the information on the segmentation point of the original sentence before the division and adjusting the words at the beginning and end of the sentence of the short sentence segment, information on the cohesiveness of the sentence, the tense and the law of the original sentence is summarized Can be included. Therefore, it is possible to realize the information on the cohesion, tense, and law of the input document Sin in the summary document Sab with high accuracy.

【００４９】[0049]

【発明の効果】以上のとおり、請求項１または４の発明
では、入力文書から情報素列を求めて所定の分割パター
ンと照合することで分割点を決定して入力文書を複数の
短文に分割してから分割された複数の短文に基づいて要
約文書を生成する際に、分割された複数の短文から所定
のパラメータに基づき所定数の短文を採用した要約文を
生成し、採用された短文と連続する他の短文が採用され
ているか否かに応じて分割点の前後の短文に語句の調整
を行うことで要約文書を文として完成した形式で出力し
ているので、文としての形態を保った適切な要約文書が
得られ、要約精度を向上させることができる。As described above, according to the first or fourth aspect of the present invention, the input document is divided into a plurality of short sentences by determining an information element sequence from the input document and collating it with a predetermined division pattern to determine a division point. When generating a summary document based on the plurality of divided short sentences and then, from the plurality of divided short sentences, generates a summary sentence employing a predetermined number of short sentences based on predetermined parameters, and the adopted short sentence By adjusting the words to the short sentences before and after the division point according to whether or not another continuous short sentence is adopted, the summary document is output as a completed sentence, so the sentence form is maintained. A suitable summary document can be obtained, and the accuracy of the summary can be improved.

【００５０】[0050]

【００５１】[0051]

【００５２】請求項２または５の発明では、語句変更規
則を利用して、要約文の連結点の語句を変更または追加
または削除することにより語句の調整を行い、要約文書
を出力しているので、文としての形態を保った要約文書
を出力することできる。According to the second or fifth aspect of the present invention, the word is adjusted by changing or adding or deleting the word at the connection point of the summary using the word change rule, and the summary document is output. In addition, it is possible to output a summary document in the form of a sentence.

【００５３】請求項３または６の発明では、入力文書と
要約文の両方を利用して連結点の語句の調整を行なうの
で、入力文書の分割部分に含まれる文の結束性、時制、
法の情報の少なくともいずれか一つの情報を要約文書に
含めることできる。According to the third or sixth aspect of the present invention, the phrase at the connection point is adjusted by using both the input document and the abstract sentence, so that the coherence, tense,
At least one piece of information of law can be included in the summary document.

[Brief description of the drawings]

【図１】本発明を適用した自然言語処理装置による基本
的な処理手順を示す流れ図である。FIG. 1 is a flowchart showing a basic processing procedure by a natural language processing apparatus to which the present invention is applied.

【図２】自動短文分割部１２による詳細な処理手順を示
す流れ図である。FIG. 2 is a flowchart showing a detailed processing procedure by an automatic short sentence division unit 12;

【図３】自動要約部１４による詳細な処理手順を示す流
れ図である。FIG. 3 is a flowchart showing a detailed processing procedure by an automatic summarizing unit 14;

【図４】連結点（分割点）調整部１６による詳細な処理
手順を示す流れ図である。FIG. 4 is a flowchart showing a detailed processing procedure by a connection point (division point) adjustment unit 16;

[Explanation of symbols]

１０文書入力部１２自動短文分割部１４自動要約部１６連結点（分割点）調整部１８要約文出力部Ｆｄｖ分割パターンファイル DESCRIPTION OF SYMBOLS 10 Document input part 12 Automatic short sentence division part 14 Automatic summarization part 16 Connection point (division point) adjustment part 18 Abstract sentence output part Fdv division pattern file

フロントページの続き (56)参考文献特開平７−192000（ＪＰ，Ａ) 特開昭61−117658（ＪＰ，Ａ) 特開平２−44462（ＪＰ，Ａ) 特開平３−138756（ＪＰ，Ａ) 特開平２−81175（ＪＰ，Ａ) 金淵培，江原暉将，「日英機械翻訳のための日本語単文分割と主語の補完」, 情報処理学会論文誌ｖｏｌ．35，ｎｏ. ６，ｐｐ．1018−1028（平成６年６月15 日発行) 武石英二，林良彦，「接続構造解析に基づく日本語複文の分割」，情報処理学会論文誌ｖｏｌ．33，ｎｏ．５，ｐｐ. 652−662（平成４年５月15日発行) 阿部ひろみ，奥西稔幸ほか，「接続助詞に注目した文分割の一方式」，情報処理学会第42回（平成３年前期）全国大会講演論文集，ｐｐ．３−13〜３−14（平成３年２月25日発行) 長尾真ほか編集，岩波情報科学辞典, ｐｐ．54−55（平成２年５月25日第１刷発行) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/27 G06F 17/30 Continuation of the front page (56) References JP-A-7-192000 (JP, A) JP-A-61-117658 (JP, A) JP-A-2-44462 (JP, A) JP-A-3-138756 (JP, A) , A) JP-A-2-81175 (JP, A) Kanabuchi, M. and Ehara, T., "Japanese single sentence segmentation and subject complementation for Japanese-to-English machine translation," IPSJ Transactions, vol. 35, no. 1018-1028 (published June 15, 1994) Takeshi Quartz, Yoshihiko Hayashi, "Division of Japanese compound sentences based on connection structure analysis", Transactions of Information Processing Society of Japan, vol. 33, no. 5, pp. 652-662 (issued on May 15, 1992) Hiromi Abe, Toshiyuki Okunishi et al., "A Method of Sentence Segmentation Focusing on Conjunctive Particles," The 42nd Information Processing Society of Japan (Early 1991) ) National Convention Lecture Papers, pp. 3-13 to 3-14 (published February 25, Heisei 19) Edited by Makoto Nagao et al., Dictionary of Iwanami Information Science, pp. 54-55 (Issue 1st May 25, 1990) (58) Fields investigated (Int. Cl. ⁷ , DB name) G06F 17/27 G06F 17/30

Claims

(57) [Claims]

1. A natural language processing apparatus for automatically generating a summary document corresponding to the content from an input document, wherein a division point is determined by obtaining an information element sequence from the input document and collating it with a predetermined division pattern. A dividing means for dividing the input document into a plurality of short sentences; generating a summary sentence employing a predetermined number of short sentences based on a predetermined parameter from the divided short sentences; Summary generating means for outputting the summary document as a completed sentence by performing word / phrase adjustment on the short sentences before and after the division point according to whether or not the short sentence is adopted. Natural language processor.

2. The method according to claim 1, wherein the summarizing means adjusts the word by changing or adding or deleting a word at a connection point of the summarizing sentence using a word changing rule, and outputs the summarized document. The natural language processing device according to claim 1, wherein:

3. The method according to claim 1, wherein both of the input document and the summary sentence are used to adjust the phrase at the connection point, and at least information on the cohesiveness, tense, and law of the sentence included in the divided portion of the input document 3. The natural language processing apparatus according to claim 2, wherein any one of the information is included in the summary document.

4. A natural language processing method for automatically generating a summary document according to the contents from an input document, wherein a division point is determined by obtaining an information element sequence from the input document and collating with a predetermined division pattern. A dividing step of dividing the input document into a plurality of short sentences; generating a summary sentence employing a predetermined number of short sentences based on a predetermined parameter from the divided short sentences; A summary generating step of outputting the summary document as a completed sentence by performing word / phrase adjustment on the short sentences before and after the division point according to whether or not the short sentence is adopted. Natural language processing method.

5. The method according to claim 1, wherein in the summarizing step, the word is adjusted by changing or adding or deleting a word at a connection point of the summarizing sentence using a word change rule, and the summarizing document is output. Claim 4
Natural language processing method described in 1.

6. A method of adjusting a word at the connection point using both the input document and the summary sentence, and at least one of information on cohesiveness, tense, and law of a sentence included in a divided portion of the input document. 6. The natural language processing method according to claim 5, wherein at least one piece of information is included in the summary document.