JP2715875B2

JP2715875B2 - Multilingual summary generator

Info

Publication number: JP2715875B2
Application number: JP5330277A
Authority: JP
Inventors: 真一安藤; 伸一土井
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1993-12-27
Filing date: 1993-12-27
Publication date: 1998-02-18
Anticipated expiration: 2013-02-18
Also published as: JPH07192011A

Description

【発明の詳細な説明】【０００１】【産業上の利用分野】本発明は自然言語で記述された文
書を解析し、予め指定された内容を要約し、多言語翻訳
を行って出力する多言語要約生成装置に関するものであ
る。【０００２】【従来の技術】従来、要約文を生成する手法には、重要
語や特定の語彙、あるいは接続詞や文末表現を手がかり
として文書中の各文の重要度を評価し、重要な文から要
約文を組み上げる手法があった。しかし、これらの手法
は一般的な文書内容を対象として要約文を生成すること
を目的としているため、多様な言語現象に対応できず、
利用者の意図に応じた出力は得られなかった。また、予
め利用者が知りたい情報を詳細に設定することにより対
象領域を制限し、ある決まった構造で情報を抽出する情
報抽出手法には、例えば「情報処理学会第47回全国大
会講演論文集第３巻８３ページ」に記載のものが知
られている。ここに記載された装置では、対象領域に関
するキーワードと構文構造を利用したキーワード間関係
計算規則によって指定された情報を抽出し、フレーム形
式で抽出した情報を出力している。しかし、フレームの
ような形式は、利用者が各スロットの定義を熟知してい
る必要があり、また、特に指定情報の構造が複雑である
場合、理解しづらいという問題点がある。更に、数多く
の機械翻訳装置が提案されているが、一般の話題を対象
として新聞記事などの長い文を含む文書を高品質で翻訳
できる装置はなかった。このため、例えば新聞記事のク
リッピングサービスなどではキーワードで検索した記事
をそのままの形で、あるいは人手で要約、翻訳を行って
提供していた。【０００３】【発明が解決しようとする課題】従来の技術で述べた要
約手法は、文書に含まれる一般的な内容を抽出すること
を目的としているため、深い意味解析が必要となる。し
かし、実際の文書中には多種多様の言語現象が現れるた
め、これら全てに対応し、利用者が求める内容の正確な
要約文を生成することは困難であった。また従来提案さ
れている要約手法では、各文の重要度の定義は予め装置
が独自に備えており、利用者が何を知りたいかによって
要約する内容を変更することはできない。一般的な内容
を対象とせず、予め設定された領域のみを対象とする従
来の情報抽出手法は対象領域が予め指定されているため
正確な情報を抽出することができる。しかしその出力は
フレーム形式で与えられるため、出力を直接読み、利用
する利用者には理解しづらいという問題点がある。更に
機械翻訳では、原言語を解析する段階で構文レベルにお
いても語彙レベルにおいても曖昧性が生じて文が長くな
ると解析できない、文を越えた文脈レベルを扱えないと
いった問題点があり、翻訳結果の品質は悪かった。この
ため、例えば新聞記事クリッピングなどの場合、キーワ
ード検索のみによって得られた記事情報を全文書のま
ま、あるいは人手で要約、翻訳などの作業を行った後に
利用していた。【０００４】本発明の目的は入力文書に対し、利用者の
指定に応じた、高品質の要約を多言語で生成することで
ある。ここでは予め要約対象を指定することにより正確
な要約内容の抽出が可能であるため、利用者の指定に応
じた正確な要約を出力することができる。また、利用者
は利用に応じて要約対象を変更することができるため、
要約する内容を変更することができる。更に、要約内容
の抽出で得られる情報は曖昧性のない一定の形式である
ため、高品質の翻訳結果を出力することができる。【０００５】【課題を解決するための手段】上述した問題を解決する
ため、発明した多言語要約生成装置は自然言語で記述さ
れた文書を入力として受けつける文書入力部と、予め抽
出すべきと指定された要約対象に関係するキーワードと
指定された対象において各キーワードが持つ詳細情報を
格納するキーワード辞書部と、前記キーワード辞書部に
格納されたキーワード情報を利用してキーワード同士の
関係を認定し、指定された要約対象に一致するキーワー
ドの選択および合成を行う規則を格納したキーワード間
関係計算規則格納部と、前記キーワード辞書部に格納さ
れたキーワード情報と前記キーワード間関係計算規則格
納部に格納された規則を利用して、指定された要約対象
をキーワードとキーワード同士の関係として抽出する文
書情報抽出部と、前記文書情報抽出部で出力されたキー
ワードとキーワード同士の関係構造から成る一定の形式
を文を表す中間構造に変換し、必要な場合には複数の中
間構造に分割する規則を格納する文構造生成規則格納部
と、前記文構造生成規則格納部に格納された規則を利用
して前記文書情報抽出部が出力した指定情報の構造を文
を表す中間構造に変換する文構造生成部と、各キーワー
ドと目標言語の対応を示す目標言語辞書部と、文を表す
中間構造から自然言語文を生成する規則を格納した文生
成規則格納部と、前記目標言語辞書部に格納された語彙
情報と前記文生成規則格納部に格納された規則を利用し
て前記文構造生成部が出力した中間構造から自然言語文
を生成する文生成部と、前記文生成部が出力した自然言
語文を出力、表示する要約文出力部を備えている。【０００６】【実施例】次に本発明について図面を参照して説明す
る。【０００７】第1 図は本発明の請求項1 記載の一実施例
を示すブロック図である。第1 図を参照すると本発明
は、自然言語で記述された文書を入力として受けつける
文書入力部1 と、予め抽出すべきと指定された要約対象
に関係するキーワードと指定された対象において各キー
ワードが持つ詳細情報を格納するキーワード辞書部２
と、前記キーワード辞書部２に格納されたキーワード情
報を利用してキーワード同士の関係を認定し、指定され
た要約対象に一致するキーワードの選択および合成を行
う規則を格納したキーワード間関係計算規則格納部３
と、前記キーワード辞書部２に格納されたキーワード情
報と前記キーワード間関係計算規則格納部３に格納され
た規則を利用して、指定された要約対象をキーワードと
キーワード同士の関係として抽出する文書情報抽出部４
と、前記文書情報抽出部４で出力されたキーワードとキ
ーワード同士の関係構造から成る一定の形式を文を表す
中間構造に変換し、必要な場合には複数の中間構造に分
割する規則を格納する文構造生成規則格納部５と、前記
文構造生成規則格納部５に格納された規則を利用して前
記文書情報抽出部４が出力した指定情報の構造を文を表
す中間構造に変換する文構造生成部６と、各キーワード
と目標言語の対応を示す目標言語辞書部７と、文を表す
中間構造から自然言語文を生成する規則を格納した文生
成規則格納部８と、前記目標言語辞書部７に格納された
語彙情報と前記文生成規則格納部８に格納された規則を
利用して前記文構造生成部６が出力した中間構造から自
然言語文を生成する文生成部９と、前記文生成部９が出
力した自然言語文を出力、表示する要約文出力部１０か
ら構成される。【０００８】次に第１図を参照して、本発明の実施例の
動作について説明する。【０００９】本発明の一実施例として、「どのような企
業がどのような半導体製造技術を開発、製造、販売して
いるか、あるいは利用しているか」という半導体製造技
術の内容を利用者が知りたがっている場合を考える。ま
た文書情報抽出部４に対し、例えば第２図に示すよう
な、キーワードとキーワード間関係から成る出力形式が
与えられたとする。第２図は文書内に半導体製造技術に
関する内容が存在し、その要約内容は日本電気という開
発者とエッチング技術から成ることを示している。ここ
で、日本電気は東京都に存在することを示している。ま
たエッチング技術はその分類がプラズマエッチング技術
であり、６４メガビット用ＤＲＡＭに対応する技術であ
ることを示している。【００１０】ここで例えば、第３図に示す入力文書を考
える。【００１１】文書入力部１から入力された文書は文書情
報抽出部４に渡される。キーワード辞書部２には利用者
が指定した対象領域に関するキーワードが格納されてお
り、例えば半導体製造技術については、「日本電気」な
ど企業を表すキーワード、「プラズマエッチング」など
半導体製造技術を表すキーワード、「開発」など企業と
半導体製造技術の関係を表すキーワードと、それぞれに
ついての詳細情報が格納されている。またキーワード間
関係計算規則格納部３には、キーワード辞書部２に格納
されたキーワード間の関係、あるいは共起関係や構文構
造などを利用して、キーワードを組み合わせる方法やキ
ーワードに付された詳細情報を合成する手法を記述した
キーワード間関係計算規則が格納されている。文書情報
抽出部4はキーワード辞書部２に格納されたキーワード
情報とキーワード間関係計算規則格納部３に格納された
キーワード間関係計算規則を利用して入力された文書を
解析し、要約対象の内容を抽出し、指定された形式で出
力する。例えば第３図に示す文書の場合、まず、キーワ
ード辞書部２に格納されたキーワード、「日本電気」
「プラズマエッチング」「開発」などの語彙をキーワー
ドとして認識する。さらに、キーワード間関係計算規則
格納部３に納められた規則を適応する。例えば構文構造
を利用してキーワード間関係を認定し、関係あるキーワ
ードの詳細情報を合成することによって、文書情報抽出
部４は第２図に示す構造を出力する。文構造生成部６は
文書情報抽出部４の出力したデータを受け取り、文構造
生成規則格納部5 に納められた規則を適応することによ
って、受け取ったデータを文の構造を示す中間構造に変
換する。例えば第２図のデータを受け取った文構造生成
部６は、文書情報抽出部４の出力する構造についての知
識を基に記述された文構造生成規則を適応して、第４図
に示す変換結果を出力する。第４図では「どのような企
業がどのような半導体製造技術を開発、製造、販売して
いるか、あるいは利用しているか」という抽出対象の形
式に従い、用言「開発」を中心として「日本電気」を主
格、「プラズマエッチング技術」を目的格とする木構造
に変換している。また、文書情報抽出部４の出力した構
造が複雑な場合には、予め設定した文書情報抽出部４の
出力に応じた文分割規則を記述し、文構造生成規則格納
部６に納めることによって、複数の木構造に分けること
も可能である。文生成部９は文構造生成部６の出力を受
け取り、目標言語辞書部７に格納された対訳と文生成規
則格納部８に納められた文法規則を利用して文を生成す
る。例えば、目標言語を英語として目標言語辞書部７と
文生成規則格納部８を構成した場合、第４図の木構造の
入力に対して第５図の出力が得られる。要約文出力部１
０は文生成部９が出力した文を表示する。【００１２】【発明の効果】本発明では要約の視点となる情報を対象
領域と出力形式である中間構造を設定することによっ
て、利用者の意図に正確に応じた要約文を生成すること
ができる。また対象領域は変更可能であるため、複数の
要約の視点を利用者に提供することができる。さらに全
文を解析の対象とせず、キーワードを中心とした部分的
な解析で要約内容を抽出でき、曖昧性のない一定の中間
構造形式を得ることができる。すなわち、翻訳処理にお
いてはこの中間構造から直接翻訳を行うため、高品質の
多言語要約文を出力することができる。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention analyzes a document described in a natural language, summarizes contents specified in advance, performs multilingual translation, and outputs the result. The present invention relates to a summary generation device. 2. Description of the Related Art Conventionally, a method of generating a summary sentence involves evaluating the importance of each sentence in a document based on an important word, a specific vocabulary, or a conjunction or end-of-sentence expression. There was a technique to compose a summary sentence. However, since these methods aim to generate summary sentences for general document contents, they cannot cope with various linguistic phenomena,
No output according to the user's intention was obtained. Information extraction methods that limit the target area by setting in advance the information that the user wants to know in detail and extract information with a certain fixed structure include, for example, “The 47th Annual Conference of IPSJ Vol. 3, page 83 ". The apparatus described here extracts information specified by a keyword relation calculation rule using a keyword and a syntax structure regarding a target area, and outputs the extracted information in a frame format. However, a format such as a frame requires a user to be familiar with the definition of each slot, and has a problem that it is difficult to understand especially when the structure of the designation information is complicated. Furthermore, although a number of machine translation devices have been proposed, there has been no device capable of translating a document including a long sentence such as a newspaper article with high quality for a general topic. For this reason, for example, in a newspaper article clipping service, an article retrieved by a keyword is provided as it is or after being summarized and translated manually. [0003] The summarization method described in the prior art aims at extracting general contents included in a document, and therefore requires a deep semantic analysis. However, since various linguistic phenomena appear in an actual document, it has been difficult to generate an accurate summary of the contents required by the user in response to all of these phenomena. In addition, in the summarization method proposed in the past, the definition of the importance of each sentence is uniquely provided in advance in the apparatus, and the contents to be summarized cannot be changed depending on what the user wants to know. A conventional information extraction method that targets only a preset area without targeting general contents can extract accurate information because the target area is specified in advance. However, since the output is given in a frame format, there is a problem that the user directly reads the output and it is difficult for a user to understand the output. Furthermore, in machine translation, there is a problem that ambiguity occurs at the syntactic level and the vocabulary level at the stage of analyzing the source language, so that the sentence cannot be analyzed if the sentence is long, and the context level beyond the sentence cannot be handled. Quality was bad. For this reason, in the case of newspaper article clipping, for example, article information obtained only by keyword search is used as a whole document or after performing work such as summarizing and translating manually. An object of the present invention is to generate a high-quality summary in multiple languages according to a user's specification for an input document. Here, since an accurate summary content can be extracted by designating a summary target in advance, an accurate summary according to the user's designation can be output. Also, since the user can change the summary target according to the use,
You can change what you summarize. Furthermore, since the information obtained by extracting the summary content is in a certain format without ambiguity, a high-quality translation result can be output. [0005] In order to solve the above-mentioned problems, the invented multilingual digest generation apparatus specifies a document input unit for receiving a document described in a natural language as an input, and designates a document to be extracted in advance. A keyword dictionary unit that stores detailed information of each keyword in the designated target and a keyword related to the summarized target, and a relationship between the keywords is identified using the keyword information stored in the keyword dictionary unit. An inter-keyword relation calculation rule storage unit that stores rules for selecting and synthesizing keywords that match the designated summarization target; a keyword information stored in the keyword dictionary unit and an inter-keyword relation calculation rule storage unit Document information that extracts the specified summary target as keywords and the relationship between keywords using the specified rules And a rule for converting a certain format consisting of the keywords output from the document information extracting unit and a relational structure between the keywords into an intermediate structure representing a sentence and, if necessary, dividing it into a plurality of intermediate structures. A sentence structure generation rule storage unit to be stored, and a sentence structure generation unit that converts the structure of the designated information output by the document information extraction unit into an intermediate structure representing a sentence using the rules stored in the sentence structure generation rule storage unit Part, a target language dictionary part indicating correspondence between each keyword and a target language, a sentence generation rule storage part storing rules for generating a natural language sentence from an intermediate structure representing a sentence, and a target language dictionary part stored in the target language dictionary part. A sentence generation unit that generates a natural language sentence from the intermediate structure output by the sentence structure generation unit using vocabulary information and rules stored in the sentence generation rule storage unit, and a natural language sentence output by the sentence generation unit Output and display It has a Yakubun output section. Next, the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing one embodiment of the first aspect of the present invention. Referring to FIG. 1, the present invention provides a document input unit 1 that receives a document described in a natural language as an input, and a keyword related to a summary target specified to be extracted in advance and a keyword specified in the specified target. Keyword dictionary 2 for storing detailed information
And a rule for calculating a relation between keywords storing rules for determining the relationship between keywords by using the keyword information stored in the keyword dictionary unit 2 and selecting and synthesizing a keyword that matches the specified summary target. Part 3
Document information for extracting a specified summary target as a relationship between keywords using the keyword information stored in the keyword dictionary unit 2 and the rules stored in the inter-keyword relationship calculation rule storage unit 3 Extraction unit 4
And a rule for converting a certain format consisting of the keywords and the relational structure between the keywords output by the document information extracting unit 4 into an intermediate structure representing a sentence and, if necessary, dividing it into a plurality of intermediate structures. A sentence structure generation rule storage unit 5 and a sentence structure for converting the structure of the designated information output by the document information extraction unit 4 into an intermediate structure representing a sentence by using the rules stored in the sentence structure generation rule storage unit 5 A generation unit 6, a target language dictionary unit 7 indicating correspondence between each keyword and a target language, a sentence generation rule storage unit 8 storing rules for generating a natural language sentence from an intermediate structure representing a sentence, and the target language dictionary unit A sentence generation unit 9 for generating a natural language sentence from the intermediate structure output by the sentence structure generation unit 6 using the vocabulary information stored in the sentence generation rule 7 and the rules stored in the sentence generation rule storage unit 8; Natural language sentence output by the generator 9 Output, and a summary output unit 10 for displaying. Next, the operation of the embodiment of the present invention will be described with reference to FIG. As one embodiment of the present invention, a user knows the contents of semiconductor manufacturing technology such as "what kind of company develops, manufactures, sells, or uses what kind of semiconductor manufacturing technology". Let's consider the case. It is also assumed that the document information extraction unit 4 is provided with an output format including, for example, keywords and relationships between keywords, as shown in FIG. FIG. 2 shows that the contents related to the semiconductor manufacturing technology exist in the document, and that the summary content consists of a developer called NEC and an etching technology. Here, it is shown that NEC exists in Tokyo. The etching technology is classified into a plasma etching technology, which is a technology corresponding to a DRAM for 64 megabits. Here, for example, consider the input document shown in FIG. The document input from the document input unit 1 is passed to a document information extracting unit 4. The keyword dictionary unit 2 stores keywords related to a target area specified by the user. For example, for semiconductor manufacturing technology, a keyword indicating a company such as “NEC”, a keyword indicating a semiconductor manufacturing technology such as “plasma etching”, A keyword such as "development" indicating the relationship between the company and the semiconductor manufacturing technology and detailed information on each are stored. The keyword relationship calculation rule storage unit 3 stores a method of combining keywords by using a relationship between keywords stored in the keyword dictionary unit 2, a co-occurrence relationship, a syntax structure, and the like, and detailed information attached to the keyword. Is stored, which describes a method for calculating the relationship between keywords, which describes a method of synthesizing. The document information extraction unit 4 analyzes the input document by using the keyword information stored in the keyword dictionary unit 2 and the keyword relation calculation rule stored in the keyword relation calculation rule storage unit 3, and summarizes the contents to be summarized. Is extracted and output in the specified format. For example, in the case of the document shown in FIG. 3, first, the keyword “NEC” stored in the keyword dictionary unit 2
Recognize words such as "plasma etching" and "development" as keywords. Further, the rules stored in the keyword relation calculation rule storage unit 3 are applied. For example, the document information extraction unit 4 outputs the structure shown in FIG. 2 by recognizing the relationship between the keywords using the syntax structure and synthesizing the detailed information of the related keywords. The sentence structure generation unit 6 receives the data output from the document information extraction unit 4, converts the received data into an intermediate structure indicating the structure of the sentence by applying the rules stored in the sentence structure generation rule storage unit 5. . For example, the sentence structure generation unit 6 receiving the data shown in FIG. 2 applies the sentence structure generation rules described based on the knowledge about the structure output from the document information extraction unit 4 and converts the conversion result shown in FIG. Is output. In FIG. 4, according to the format of the extraction target of "what kind of company develops, manufactures, sells, or uses what kind of semiconductor manufacturing technology", "NEC" focuses on the word "development". Is converted to a tree structure whose main purpose is "plasma etching technology". When the structure output from the document information extraction unit 4 is complicated, a sentence division rule corresponding to a preset output from the document information extraction unit 4 is described and stored in the sentence structure generation rule storage unit 6. It is also possible to divide into multiple tree structures. The sentence generation unit 9 receives the output of the sentence structure generation unit 6 and generates a sentence using the bilingual translation stored in the target language dictionary unit 7 and the grammar rules stored in the sentence generation rule storage unit 8. For example, when the target language dictionary unit 7 and the sentence generation rule storage unit 8 are configured with the target language being English, the output of FIG. 5 is obtained for the input of the tree structure of FIG. Summary sentence output unit 1
0 indicates the sentence output by the sentence generation unit 9. According to the present invention, by setting a target area and an intermediate structure, which is an output format, of information serving as a view point of a summary, a summary sentence can be generated accurately according to a user's intention. . In addition, since the target area can be changed, it is possible to provide the user with a plurality of summary viewpoints. Furthermore, the summary can be extracted by partial analysis centering on the keyword without analyzing the entire sentence, and a certain intermediate structure format without ambiguity can be obtained. That is, in the translation processing, since the translation is performed directly from the intermediate structure, a high-quality multilingual summary sentence can be output.

【図面の簡単な説明】【図１】図は本発明の請求項1 記載の一実施例であるブ
ロック図を説明する図である。【図２】図は本発明の一実施例を説明する図である。【図３】図は本発明の一実施例を説明する図である。【図４】図は本発明の一実施例を説明する図である。【図５】図は本発明の一実施例を説明する図である。【符合の説明】１文書入力部２キーワード辞書部３キーワード間関係計算規則格納部４文書情報抽出部５文書構造生成規則格納部６文構造生成部７目標言語辞書部８文生成規則格納部９文生成部１０要約文出力部BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram for explaining a block diagram which is an embodiment of claim 1 of the present invention. FIG. 2 is a diagram illustrating an embodiment of the present invention. FIG. 3 is a diagram illustrating an embodiment of the present invention. FIG. 4 is a diagram illustrating an embodiment of the present invention. FIG. 5 is a diagram illustrating an embodiment of the present invention. [Description of codes] 1 Document input unit 2 Keyword dictionary unit 3 Inter-keyword relation calculation rule storage unit 4 Document information extraction unit 5 Document structure generation rule storage unit 6 Sentence structure generation unit 7 Target language dictionary unit 8 Sentence generation rule storage unit 9 Sentence generation unit 10 Summary sentence output unit

Claims

(57) [Claims] [Claim 1] A document input unit that receives a document described in a natural language as an input, and a keyword related to a summary target specified to be extracted in advance and a specified target A keyword dictionary unit for storing detailed information of each keyword, and a relationship between the keywords is identified using the keyword information stored in the keyword dictionary unit, and selection and synthesis of keywords that match the specified summary target are performed. A storage unit for calculating a relation between keywords storing rules to be performed, and a keyword for summarizing the designated summarization target by using the keyword information stored in the keyword dictionary unit and the rules stored in the storage unit for calculating the relation between keywords. A document information extraction unit for extracting a relationship between keywords and keywords; and a keyword and keyword output by the document information extraction unit. A sentence structure generation rule storage unit that stores a rule that converts a certain format consisting of relational structures between documents into an intermediate structure representing a sentence and, if necessary, stores a rule for dividing the sentence structure into a plurality of intermediate structures; A sentence structure generating unit that converts the structure of the designated information output by the document information extracting unit into an intermediate structure representing a sentence by using rules stored in a unit, and a target language dictionary unit that indicates a correspondence between each keyword and a target language A sentence generation rule storage unit storing rules for generating a natural language sentence from an intermediate structure representing a sentence, using vocabulary information stored in the target language dictionary unit and rules stored in the sentence generation rule storage unit A sentence generation unit that generates a natural language sentence from the intermediate structure output by the sentence structure generation unit; and a summary sentence output unit that outputs and displays the natural language sentence output by the sentence generation unit. Multilingual Summary Generator