JPH09114850A

JPH09114850A - Document processor

Info

Publication number: JPH09114850A
Application number: JP7268406A
Authority: JP
Inventors: Toshiyuki Sugio; 俊之杉尾
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1995-10-17
Filing date: 1995-10-17
Publication date: 1997-05-02

Abstract

PROBLEM TO BE SOLVED: To automatically divide an input document corresponding to the change of a contex or a topic by providing a word life computing element obtaining information on a word life in which the word operates an asserting function in a document based on the appearing position of an optical word in the input document so as to divide the input document by referring to life information on this word. SOLUTION: When an optical input document 1 is inputted to the life computing element 2, the input document 1 is divided into sentences being computing units. After being divided, the respective sentences are sent to a computing object recognizing means 21 in order to recognize the computing objects. Next, after obtaining the potential of the word through the use of a potential computing means 22, the life of the word is estimated through the use of a life estimating means 23. Namely the potential distribution of the word, the life of the word, etc., are calculated from the potential of each word in a life table 3 and the result is stored in the slot of the life table 3. After then the life computing element 2 starts a document dividing filter 4 and a feature word extraction filter 5.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、任意の自然言語で
記述された文書を処理する文書処理装置に関し、特に、
文書分類装置や特徴単語抽出装置として機能し得るもの
である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document processing apparatus for processing a document described in any natural language, and more particularly,
It can function as a document classification device and a characteristic word extraction device.

【０００２】[0002]

【従来の技術】近年、コンピュータ・ネットワークやＣ
Ｄ−ＲＯＭ等に代表されるように電子化されたメディア
を介して、電子化された大規模な自然言語で記述された
文書が安価に提供されるようになってきた。これに伴
い、組織や個人においても電子化文書を扱う機会が増
え、これらの電子化文書をいかに有効に活用するかが重
大な課題になりつつある。2. Description of the Related Art In recent years, computer networks and C
Documents described in a large-scale computerized natural language have come to be provided at low cost via a computerized medium represented by a D-ROM or the like. Along with this, the opportunities for organizations and individuals to handle electronic documents are increasing, and how to effectively utilize these electronic documents is becoming a serious issue.

【０００３】大量の電子化文書を扱うには、その文書の
カテゴリを同定して分類しておくが、所望する文書の検
索等を考慮した場合に必要である。このような機能を担
う装置としては、例えば特開平６−７５９９５号公報に
開示された文書処理装置（自動文書付与装置）がある。In order to handle a large amount of digitized documents, the categories of the documents are identified and categorized, which is necessary in consideration of retrieval of desired documents. An example of a device having such a function is a document processing device (automatic document adding device) disclosed in Japanese Patent Laid-Open No. 6-75995.

【０００４】この文書処理装置は、図２に示すように、
処理装置１００、入力装置１０１、出力装置１０２、内
部メモリ１０３及び外部メモリ１０４等からなるワーク
ステーション等の情報処理装置で実現されており、予め
分類が付与された複数の文書（分類付与文書）における
分類とキーワードに関するデータに基づいて分類のため
の基礎データを作成しておき、この基礎データを用い
て、分類未付与文書に、その記載内容に適合した分類を
付与するのである。This document processing apparatus, as shown in FIG.
It is realized by an information processing device such as a workstation including a processing device 100, an input device 101, an output device 102, an internal memory 103, an external memory 104, and the like. Basic data for classification is created based on the data about the classification and the keyword, and the basic data is used to add the classification suitable for the description content to the unclassified document.

【０００５】ここで、分類付与処理のための基礎データ
としては、２つの分類間の関連性の強さを表わす図３に
示す分類距離テーブルと、キーワード毎に、そのキーワ
ードに関連の深い分類及びその分類の関連の深さを示す
度合いを記憶した図４に示すキーワード／分類テーブル
を用いられる。Here, as the basic data for the classification giving process, the classification distance table shown in FIG. 3 showing the strength of the relationship between the two classifications, and the classification deeply related to each keyword and the classification distance table shown in FIG. The keyword / classification table shown in FIG. 4 in which the degree of the depth of the classification is stored is used.

【０００６】この文書処理装置に、分類未付与文書に含
まれる複数のキーワードが入力されると、キーワード／
分類テーブルを参照して、入力されたキーワードに関連
する分類の関連度合の合計値を分類毎に算出し、この合
計値の大きさの順序に従って付与すべき分類の候補を決
定し、その後、分類距離テーブルを参照して、決定され
た複数の候補分類相互間の距離が妥当な範囲内にあるか
否かを検査し、分類未付与文書の最終分類（１個とは限
らない）を決定する。When a plurality of keywords included in the uncategorized document are input to this document processing device, the keywords / keyword /
With reference to the classification table, calculate the total value of the degree of association of the classifications related to the input keyword for each classification, determine the classification candidates to be assigned according to the order of the size of this total value, and then classify By referring to the distance table, it is checked whether the determined distances between the plurality of candidate classifications are within a proper range, and the final classification (not necessarily one) of the unclassified document is determined. .

【０００７】[0007]

【発明が解決しようとする課題】しかしながら、従来の
文書処理装置は、以下のような課題を有するものであっ
た。However, the conventional document processing apparatus has the following problems.

【０００８】(1) キーワードと分類を照合するための
「キーワード／分類テーブル」や、分類間の関連性の強
さを示す「分類距離テーブル」を予め準備しておく必要
があり、これらテーブルに、対象となる文書の分類のほ
とんどを網羅したデータを格納しておかなければ、ある
程度の分類精度を得ることができない。特に、キーワー
ド／分類テーブルにおけるキーワード数が多くなければ
ある程度の分類精度を確保できない。このようにある程
度の分類精度を確保しようとすると、データ量が多くな
り、外部メモリとして大容量のものが必要となる。(1) It is necessary to prepare in advance a "keyword / classification table" for matching keywords and classifications, and a "classification distance table" indicating the strength of the relationship between classifications. However, if the data covering most of the classification of the target document is not stored, the classification accuracy cannot be obtained to some extent. In particular, unless the number of keywords in the keyword / classification table is large, it is not possible to secure a certain level of classification accuracy. In order to secure a certain degree of classification accuracy in this way, the amount of data increases and a large-capacity external memory is required.

【０００９】(2) 分類しようとする文書が「複合分野」
にかかる文書である場合（例えば、経済、政治、生活、
時事等の分野の文章を含む１日分の新聞記事の場合）に
は、対象としている分類未付与文書全体に一貫したキー
ワードの設定が困難となる。このようなキーワードの設
定が困難の状況において、人手によってキーワードを設
定した場合には、その設定に、知識や経験等により個人
差が生じる恐れがおり、分類の精度に悪影響を及ぼす恐
れもあった。(2) The document to be classified is "multi-discipline"
Documents (eg, economics, politics, life,
In the case of one-day newspaper articles including texts in fields such as current affairs), it is difficult to set consistent keywords for the entire target uncategorized document. In such a situation where it is difficult to set a keyword, if the keyword is manually set, there is a possibility that individual differences may occur in the setting due to knowledge, experience, etc., and the accuracy of classification may be adversely affected. .

【００１０】(3) さらに複合分野に渡る複合文書の最終
分類が決定できたとしても、その結果は、多くの分類を
単に羅列したものに過ぎないことが予想される。そのた
め、文書に分類を付与しても、付与された分類を利用す
る際の効果が小さくなる。(3) Even if the final classification of the compound document across the composite fields can be determined, it is expected that the result is merely a list of many classifications. Therefore, even if the classification is added to the document, the effect of using the added classification becomes small.

【００１１】[0011]

【課題を解決するための手段】第１の本発明の文書処理
装置は、(1) 自然言語で記述された文書を入力する入力
手段と、(2) 入力文書中の任意単語の出現位置に基づい
て、その単語が文書中で主張機能を作用させている単語
の寿命情報を得る単語寿命演算器と、(3) この単語寿命
演算器で得られた単語の寿命情報を格納する単語寿命情
報格納手段と、(4) この単語寿命情報格納手段に格納さ
れた単語の寿命情報を参照して上記入力文書を分割する
文書分割手段とを備えることを特徴とする。The document processing apparatus according to the first aspect of the present invention includes (1) an input unit for inputting a document described in natural language, and (2) an input position of an arbitrary word in the input document. Based on the word lifespan calculator, which obtains the lifespan information of the word on which the word acts the assertion function, and (3) the word lifespan information that stores the lifespan information of the word obtained by this word lifespan calculator It is characterized by comprising storage means and (4) document dividing means for dividing the input document by referring to the life information of words stored in the word life information storage means.

【００１２】第１の本発明の文書処理装置においては、
単語の寿命情報を参照して入力文書を分割することによ
り、文脈や話題の転換に応じて、入力文書を分割し得
る。In the document processing apparatus of the first invention,
By dividing the input document by referring to the word life information, it is possible to divide the input document according to the change of context or topic.

【００１３】第２の本発明の文書処理装置は、(1) 自然
言語で記述された文書を入力する入力手段と、(2) 入力
文書中の任意単語の出現位置に基づいて、その単語が文
書中で主張機能を作用させている単語の寿命情報を得る
単語寿命演算器と、(3) この単語寿命演算器で得られた
単語の寿命情報を格納する単語寿命情報格納手段と、
(4) この単語寿命情報格納手段に格納された単語の寿命
情報を参照して、上記入力文書のインデックスとなる特
徴単語を抽出する特徴単語抽出手段とを備えることを特
徴とする。The document processing apparatus according to the second aspect of the present invention comprises: (1) input means for inputting a document described in natural language; and (2) the word based on the appearance position of an arbitrary word in the input document. A word lifespan calculator that obtains the lifespan information of a word that has an asserting function in the document, and (3) a word lifespan information storage unit that stores the lifespan information of the word obtained by this word lifespan calculator,
(4) Characteristic word extracting means for extracting a characteristic word serving as an index of the input document by referring to the word life information stored in the word life information storage means.

【００１４】第２の本発明の文書処理装置においては、
単語の寿命情報を参照して入力文書中の特徴単語を抽出
することにより、利用者が入力文書の特徴単語を指定す
るようなことを不要にし得る。In the document processing apparatus of the second invention,
By extracting the characteristic word in the input document by referring to the life information of the word, it may be unnecessary for the user to specify the characteristic word of the input document.

【００１５】第３の本発明の文書処理装置は、(1) 自然
言語で記述された文書を入力する入力手段と、(2) 入力
文書中の任意単語の出現位置に基づいて、その単語が文
書中で主張機能を作用させている単語の寿命情報を得る
単語寿命演算器と、(3) この単語寿命演算器で得られた
単語の寿命情報を格納する単語寿命情報格納手段と、
(4) この単語寿命情報格納手段に格納された単語の寿命
情報を参照して上記入力文書を分割する文書分割手段
と、(5) 上記単語寿命情報格納手段に格納された単語の
寿命情報を参照して、上記入力文書及び又は分割文書の
インデックスとなる特徴単語を抽出する特徴単語抽出手
段とを備えることを特徴とする。According to a third aspect of the document processing apparatus of the present invention, (1) input means for inputting a document described in natural language, and (2) the word based on the appearance position of an arbitrary word in the input document. A word lifespan calculator that obtains the lifespan information of a word that has an asserting function in the document, and (3) a word lifespan information storage unit that stores the lifespan information of the word obtained by this word lifespan calculator,
(4) Document division means for dividing the input document by referring to the word lifespan information stored in the word lifespan information storage means, and (5) Word lifespan information stored in the word lifespan information storage means. With reference to the above-mentioned input document and / or the divided document, a characteristic word extracting means for extracting a characteristic word serving as an index is provided.

【００１６】第３の本発明の文書処理装置においては、
単語の寿命情報を参照して入力文書を分割することによ
り、文脈や話題の転換に応じて入力文書を分割し得ると
共に、単語の寿命情報を参照して入力文書中の特徴単語
を抽出することにより、利用者が入力文書の特徴単語を
指定するようなことを不要にし得る。In the document processing apparatus of the third invention,
By dividing the input document by referring to the word life information, it is possible to divide the input document according to the change of context or topic, and also by extracting the characteristic word in the input document by referring to the word life information. This can eliminate the need for the user to specify the characteristic word of the input document.

【００１７】[0017]

【発明の実施の形態】次に、本発明による文書処理装置
の一実施形態について図面を参照して説明する。BEST MODE FOR CARRYING OUT THE INVENTION Next, an embodiment of a document processing apparatus according to the present invention will be described with reference to the drawings.

【００１８】まず、この実施形態を具体的に説明する前
に、この実施形態が従っている基本的な文書処理（文書
分類）の考え方を説明すると共に、かかる説明を通じ
て、主要用語の定義を行なう。First, before specifically describing this embodiment, the concept of basic document processing (document classification) followed by this embodiment will be described, and main terms will be defined through such description.

【００１９】入力文書を正確に分類するためには、その
文書の分類特徴を正確に定義する必要がある。分類特徴
を定義する一つの方法としては、文書が表明する主旨を
文章中の幾つかの文字列（以下、単語と呼ぶ）で代表さ
せ、このような単語を文書分類のインデックスとする方
法があり、この実施形態においても、かかる分類特徴の
定義方法を採用している。以下では、文書分類のインデ
ックスとする単語を特徴単語と呼ぶこととする。この特
徴単語の選定が、分類結果の精度に大きく影響を及ぼす
ので、入力文書から特徴単語を選定、抽出する方法が非
常に重要となってくる。In order to accurately classify the input document, it is necessary to accurately define the classification characteristics of the document. One way to define the classification feature is to represent the purpose of the document by some character strings in the sentence (hereinafter referred to as words) and use such words as an index of the document classification. Also in this embodiment, the method of defining the classification feature is adopted. Below, the word used as the index of document classification is called a characteristic word. Since the selection of the characteristic word greatly affects the accuracy of the classification result, the method of selecting and extracting the characteristic word from the input document becomes very important.

【００２０】この実施形態においては、以下の考え方
(1) 〜(7) に従って、特徴単語を抽出し、文書に分類を
付与することとしている。In this embodiment, the following concept
According to (1) to (7), characteristic words are extracted and the documents are classified.

【００２１】(1) 任意の文書に複数回出現する同一文字
列を持つ単語（以下、同一単語と呼ぶ）は、その文書中
では一貫性を保持した概念を表している。そのため、特
徴単語になり得る可能性があるものである。(1) Words having the same character string that appear multiple times in an arbitrary document (hereinafter referred to as the same word) represent the concept of maintaining consistency in the document. Therefore, it may be a characteristic word.

【００２２】(2) 文書中に同一単語が複数回出現すると
いうことは、その単語が最初に出現した位置から最後に
出現した位置までは、その単語の（何かを主張してい
る）意図が働いていると見ることができる。(2) The appearance of the same word multiple times in a document means that the word (arguing something) is intended from the first appearance position to the last appearance position of the word. Can be seen as working.

【００２３】(3) この意図が働く範囲を越えると、当然
に、その同一単語の意図が働くならないので、その同一
単語には寿命があると考えることができる。(3) If the intention of the same word exceeds the range in which the intention works, the same word can naturally be considered to have a life.

【００２４】(4) １個の同一単語は、その出現位置と、
隣接する同一単語の出現位置との関係から、その単語の
今回の出現による影響範囲（以下、単語のポテンシャル
と呼ぶ）を決定できる。文書中の同一単語の配置（出現
の仕方）は、文書が表明する文脈を忠実に反映している
ので、単語のポテンシャルも文脈の影響を受ける。(4) One identical word has its appearance position and
From the relationship with the appearance position of the same word adjacent to each other, it is possible to determine the influence range (hereinafter, referred to as word potential) due to the current appearance of the word. Since the arrangement (appearance) of the same word in the document faithfully reflects the context expressed by the document, the potential of the word is also influenced by the context.

【００２５】(5) このように考えると、一般的に言え
ば、隣接する同一単語との距離が大きい単語ほど、その
ポテンシャルが大きいということができる。但し、複合
文書中では、文脈の転換（以下、ギャップと呼ぶ）が存
在し、異なる文脈で同一単語が出現することもあり得る
ので、隣接する同一単語との距離が大きいことは必ずし
もその単語のポテンシャルが大きいということにはなら
ない。(5) Considering in this way, generally speaking, it can be said that a word having a larger distance from the same word adjacent thereto has a larger potential. However, in a compound document, there is a context change (hereinafter referred to as a gap), and the same word may appear in different contexts. Therefore, a large distance from the adjacent same word does not necessarily mean that the word is adjacent to the same word. It does not mean that the potential is large.

【００２６】(6) そこで、単語の寿命を、複数回出現す
る同一単語中の有効な複数の単語（同一文脈に属すると
考えられる単語）のポテンシャルの総和として定義する
こととした。(6) Therefore, the word life is defined as the sum of the potentials of a plurality of effective words (words that are considered to belong to the same context) in the same word that appear multiple times.

【００２７】(7) 寿命が長い単語を、その文書の特徴を
代表している特徴単語として抽出し、その単語に基づい
て、文書の分類を付与することとした。また、複数種類
の単語の寿命の分布を観察することから、複合文書のギ
ャップを決定し、分割文書部分毎の分類付与を可能とし
た。(7) A word having a long life is extracted as a characteristic word representing the characteristic of the document, and the document is classified based on the extracted word. Also, by observing the distribution of the lifespan of multiple types of words, we determined the gaps in the compound document and made it possible to assign classifications to each divided document part.

【００２８】以上の考え方に従う実施形態の文書処理装
置は、従来と同様に、図２に示すような処理装置１０
０、入力装置１０１、出力装置１０２、内部メモリ１０
３及び外部メモリ１０４等からなるワークステーション
等の情報処理装置で実現されているが、機能的には、図
１に示す構成を有する。The document processing apparatus of the embodiment according to the above concept is the processing apparatus 10 as shown in FIG.
0, input device 101, output device 102, internal memory 10
Although it is realized by an information processing device such as a workstation including the external memory 3 and the external memory 104, it functionally has the configuration shown in FIG.

【００２９】なお、情報処理装置の構成を、実施形態と
の関係から簡単に説明すると以下の通りである。入力装
置１０１は、分類未付与文書を入力するキーボード、マ
ウス、ＣＤ−ＲＯＭ読取装置等である。出力装置１０２
は、主に分類された結果を出力するもので、ＣＲＴ、プ
リンタ等である。内部メモリ１０３は、処理手段である
処理装置１００における処理プログラムやそのためのデ
ータを格納しており、外部メモリ１５は、入力された文
書データを格納している。処理装置１００は、周辺装置
１０１、１０２、１０４を適宜アクセスしながら、文書
分類動作を主として実行するものである。The configuration of the information processing apparatus will be briefly described below in relation to the embodiment. The input device 101 is a keyboard, a mouse, a CD-ROM reading device or the like for inputting an uncategorized document. Output device 102
Mainly outputs the classified results, and is a CRT, printer, or the like. The internal memory 103 stores a processing program in the processing device 100 as a processing unit and data therefor, and the external memory 15 stores input document data. The processing device 100 mainly executes the document classification operation while appropriately accessing the peripheral devices 101, 102 and 104.

【００３０】図１において、この実施形態に係る文書処
理装置は、機能的には、入出力インタフェース７、入力
文書（記憶部）１、寿命演算器２、生命表３、文書分割
フィルタ４、特徴単語抽出フィルタ５、文書素片（格納
部）６１及び文書インデックス（格納部）６２から構成
されている。文書素片（格納部）６１及び文書インデッ
クス（格納部）６２は、出力結果（格納部）６を構成し
ている。In FIG. 1, the document processing apparatus according to this embodiment is functionally provided with an input / output interface 7, an input document (storage unit) 1, a life calculator 2, a life table 3, a document division filter 4, and features. It is composed of a word extraction filter 5, a document segment (storage unit) 61, and a document index (storage unit) 62. The document segment (storage unit) 61 and the document index (storage unit) 62 form an output result (storage unit) 6.

【００３１】入出力インタフェース７は、キーボード等
の入力装置から自然言語で記述された入力文書１を取込
むと共に、ディスプレイ等の出力装置に分類結果や処理
状況等を出力するものである。The input / output interface 7 takes in the input document 1 described in natural language from an input device such as a keyboard, and outputs the classification result and processing status to an output device such as a display.

【００３２】入力文書１は、そのテキスト形式が、いわ
ゆる自然言語の形式を採っていれば、その文書の内容が
複雑な分野に渡る複合文書であっても構わない。また、
英語、日本語といった言語の種類による制限も受けな
い。If the text format of the input document 1 is a so-called natural language format, the content of the document may be a compound document that covers a complicated field. Also,
It is not restricted by the type of language such as English or Japanese.

【００３３】寿命演算器２は、入力文書１の任意の単語
の寿命等を演算するものであり、詳細には、演算対象認
識手段２１、ポテンシャル演算手段２２及び寿命推定手
段２３から構成されている。演算対象認識手段２１は、
演算対象である単語の位置を認識するものである。ポテ
ンシャル演算手段２２は、演算対象の単語のポテンシャ
ルを演算するものである。寿命推定手段２３は、演算さ
れたポテンシャルから演算対象の単語の寿命を推定する
ものである。The life calculator 2 calculates the life etc. of an arbitrary word in the input document 1. Specifically, the life calculator 2 is composed of a calculation target recognition means 21, a potential calculation means 22 and a life estimation means 23. . The calculation target recognition means 21 is
It recognizes the position of the word that is the calculation target. The potential calculator 22 calculates the potential of the word to be calculated. The life estimation means 23 estimates the life of the word to be calculated from the calculated potential.

【００３４】生命表（単語寿命情報格納手段）３は、入
力文書１の任意の文字列を持つ単語の推定された寿命等
を記憶するものであり、例えば、図５に示すような構成
を有するユニット（以下、生命表ユニットと呼ぶ）２０
０を複数備えてなっている。図５において、生命表３の
１ユニット２００は、入力文書１の任意の１単語（ある
見出しを有する単語）に対するものである。The life table (word life information storage means) 3 stores the estimated life etc. of a word having an arbitrary character string in the input document 1, and has a structure as shown in FIG. 5, for example. Unit (hereinafter referred to as Life Table Unit) 20
It is equipped with multiple 0s. In FIG. 5, one unit 200 of the life table 3 is for any one word (word having a certain heading) of the input document 1.

【００３５】各生命表ユニット２００は、単語の見出し
を格納する見出しスロット２１１〜２１３と、その単語
の物理的な位置を格納する物理位置スロット２１１〜２
２３と、入力文書１の大きさ（例えば総単語数やテキス
トデータ数等）を考慮して正規化された論理的な位置を
格納する論理位置スロット２３１〜２３３と、単語の品
詞等の書誌的な情報を格納する書誌情報スロット２４１
〜２４３と、単語のポテンシャル値を格納するポテンシ
ャルスロット２５１〜２５３とを有している。同一単語
の出現毎に、同一行の異なる種類のスロットの組２１ｉ
〜２５ｉ（ｉは１〜３）に情報が設定される。すなわ
ち、第１行のスロット群２１１〜２５１には、ある同一
単語の最初の出現に係る情報が設定され、第２行のスロ
ット群２１２〜２５２には、その同一単語の２回目の出
現に係る情報が設定され、最終行のスロット群２１３〜
２５３には、その同一単語の最後の出現に係る情報が設
定される。Each life table unit 200 has heading slots 211-213 for storing the headings of words and physical position slots 211-2 for storing the physical positions of the words.
23, logical position slots 231 to 233 for storing normalized logical positions in consideration of the size of the input document 1 (for example, total number of words, number of text data, etc.), and bibliographical information such as part of speech of words. Information slot 241 for storing various information
˜243 and potential slots 251 to 253 for storing potential values of words. For each occurrence of the same word, a set 21i of different types of slots on the same line
Information is set to 25i (i is 1 to 3). That is, the slot groups 211 to 251 in the first row are set with information relating to the first occurrence of the same word, and the slot groups 212 to 252 in the second row are associated with the second occurrence of the same word. Information is set, and the slot group 213 to the last row
Information related to the last appearance of the same word is set in 253.

【００３６】また、各生命表ユニット２００は、単語の
見出しを格納する見出しスロット２６１と、同一単語の
出現頻度（上記行数に等しい）を格納する頻度スロット
２６２と、上記ポテンシャルスロット２５１〜２５３に
格納されているポテンシャルの平均値（平均ポテンシャ
ル）及び標準偏差値を格納するポテンシャル分布スロッ
ト２６３と、単語の寿命値を格納する寿命スロット２６
４と、ポテンシャル分布（２６３）と各ポテンシャル
（２５１〜２５３）から推測した文脈のギャップの位置
を格納するギャップスロット２６５とを有する。Further, each life table unit 200 has a heading slot 261 for storing a heading of a word, a frequency slot 262 for storing the appearance frequency of the same word (equal to the number of lines), and the potential slots 251 to 253. A potential distribution slot 263 that stores the average value (average potential) and standard deviation value of the stored potentials, and a life slot 26 that stores the life value of a word
4 and a potential distribution (263) and a gap slot 265 that stores the position of the gap in the context estimated from each potential (251 to 253).

【００３７】なお、同一ユニット２００内の見出しスロ
ット２１１〜２１３、２６１には、同じ見出しの情報が
設定される。生命表ユニットについて、単語の見出し情
報は１個のスロットのみに格納するようにしても良い。Information of the same heading is set in the heading slots 211 to 213 and 261 in the same unit 200. For the life table unit, the word heading information may be stored in only one slot.

【００３８】文書分割手段である文書分割フィルタ４
は、生命表３に格納された情報を用いて、入力文書１を
文脈によって分割するものである。この分割は、階層的
になされることもある。以下、このようにして分割され
た部分を文書素片と呼ぶこととする。A document division filter 4 which is a document division means.
Is to divide the input document 1 according to the context using the information stored in the life table 3. This division may be done hierarchically. Hereinafter, the part thus divided will be referred to as a document segment.

【００３９】特徴単語抽出手段である特徴単語抽出フィ
ルタ４は、生命表３に格納された情報を用いて、入力文
書１の特徴を表す１又は複数の単語（特徴単語）を抽出
するものである。The characteristic word extraction filter 4, which is a characteristic word extracting means, extracts one or a plurality of words (characteristic words) representing the characteristics of the input document 1 by using the information stored in the life table 3. .

【００４０】文書素片格納部６１は、文書分割フィルタ
４によって分割された文書素片を格納するものである。
一方、文書インデックス格納部６２は、文書分割フィル
タ４からの分割情報や、特徴単語抽出フィルタ４からの
特徴単語の情報に基づいて、各文書素片のインデックス
情報（以下、文書インデックスと呼ぶ）を格納するもの
である。The document element storage unit 61 stores the document elements divided by the document division filter 4.
On the other hand, the document index storage unit 62 stores index information (hereinafter referred to as a document index) of each document segment based on the division information from the document division filter 4 and the characteristic word information from the characteristic word extraction filter 4. It is something to store.

【００４１】これら文書素片格納部６１及び文書インデ
ックス格納部６２は出力結果格納部６を構成しており、
この出力結果格納部６に格納された情報が、入出力イン
タフェース７を介して、ユーザに出力される。The document segment storage unit 61 and the document index storage unit 62 constitute the output result storage unit 6,
The information stored in the output result storage unit 6 is output to the user via the input / output interface 7.

【００４２】図６は、文書インデックス格納部６２の格
納形式の一例、文書インデックスの格納構成を示す説明
図である。FIG. 6 is an explanatory diagram showing an example of the storage format of the document index storage unit 62, that is, the storage configuration of the document index.

【００４３】文書インデックス格納部６２は、入力文書
１が段階的に分割されるに従って、分割された文書素片
６１のインデックス情報を格納しながら階層構造として
成長していくものである。逆に、まず細分した文書素片
６１を得て、それ統合していくことによりより大きな文
書素片６１を形成して階層構造を実現するようにしても
良い。なお、各文書素片６１が階層構造をとるので、各
文書素片６１に対する文書インデックス６２を、図６の
説明ではノードと呼ぶことにする。The document index storage unit 62 grows as a hierarchical structure while storing the index information of the divided document pieces 61 as the input document 1 is divided stepwise. On the contrary, it is also possible to first obtain the document pieces 61 that have been subdivided and integrate them to form larger document pieces 61 to realize a hierarchical structure. Since each document segment 61 has a hierarchical structure, the document index 62 for each document segment 61 will be called a node in the description of FIG.

【００４４】図６は、ノード３００が規定する文書素片
が、次のレベルの複数のノード３１０、３２０、…に係
る文書素片に分割され、このレベルのノード３１０、３
２０、…が規定する文書素片が、次のレベルの複数のノ
ード３１１、３１２、…に係る文書素片に分割されてい
く様子を示しており、各レベルの各ノードに対応して文
書素片が規定されてその各文書素片に対応して文書イン
デックスが格納されている。なお、最上位レベルのノー
ド３００は、入力文書１自体に対応するノードを表すも
のである。In FIG. 6, the document segment defined by the node 300 is divided into document segments related to a plurality of nodes 310, 320, ... At the next level.
20 shows that the document segment defined by 20 ... Is divided into document segments related to a plurality of nodes 311, 312, ... At the next level, and the document segment corresponding to each node at each level is shown. A piece is defined, and a document index is stored corresponding to each of the document pieces. The highest-level node 300 represents a node corresponding to the input document 1 itself.

【００４５】各ノード、従って、インデックス情報の格
納構造は、ノードレベルに関係なく同様であり、ラベル
部３０１、識別子スロット３０２、開始位置スロット３
０３、終了位置スロット３０４及び特徴単語スロット３
０５からなる。The storage structure of each node, that is, the index information is the same regardless of the node level, and the label portion 301, the identifier slot 302, and the start position slot 3 are included.
03, end position slot 304 and characteristic word slot 3
It consists of 05.

【００４６】ラベル部３０１には、ノードのレベルと位
置とを示す番号（例えば０、１、１．１、１．２、…、
２、…）が格納される。識別子スロット３０２は、対応
する文書素片を識別するための識別子を格納する領域で
あり、格納された識別子により対応する文書素片へのア
クセスが可能となる。開始位置スロット３０３は、当該
ノードに係る文書素片の開始位置情報を格納する領域
で、１レベル上位の文書素片に対する論理位置情報で格
納される。終了位置スロット３０４は、当該ノードに係
る文書素片の終了位置情報を格納する領域で、１レベル
上位の文書素片に対する論理位置で格納される。なお、
上位ノードが存在しないトップノード３００の場合に
は、開始位置スロット３０３及び終了位置スロット３０
４にはそれぞれ、入力文書１の全体に対する論理位置の
最小値、最大値が格納される。特徴単語スロット３０５
には、当該ノードが対応する文書素片のインデックスと
なる特徴単語が格納される。The label portion 301 has numbers (for example, 0, 1, 1.1, 1.2, ...
2, ...) is stored. The identifier slot 302 is an area for storing an identifier for identifying the corresponding document element, and it becomes possible to access the corresponding document element by the stored identifier. The start position slot 303 is an area for storing the start position information of the document segment related to the node, and is stored as the logical position information for the document segment one level higher. The end position slot 304 is an area for storing the end position information of the document segment related to the node, and is stored at the logical position for the document segment one level higher. In addition,
In the case of the top node 300 having no upper node, the start position slot 303 and the end position slot 30
Each of 4 stores the minimum value and the maximum value of the logical position with respect to the entire input document 1. Feature word slot 305
A characteristic word serving as an index of a document segment corresponding to the node is stored in the.

【００４７】各ノードは、上述したように同様の形式を
持ち、文書素片が分割又は統合される毎にそのレベルを
深く又は浅くしながら成長を続ける。Each node has the same format as described above, and continues to grow while making its level deeper or shallower each time a document fragment is divided or integrated.

【００４８】次に、以上のような機能構成を有する実施
形態の文書処理装置の動作をフローチャートを参照しな
がら説明する。Next, the operation of the document processing apparatus having the above-described functional configuration will be described with reference to the flowchart.

【００４９】まず、寿命演算器２の全体動作を、図７の
フローチャートを参照しながら説明する。First, the overall operation of the life calculator 2 will be described with reference to the flowchart of FIG.

【００５０】任意の入力文書１が寿命演算器２に入力さ
れると（ステップ４０１）、入力文書１を演算のユニッ
トである文章に分割する（ステップ４０２）。なお、演
算の単位は「文章」に限定されず、任意の大きさで物理
的に制限された文字例（テキスト）であっても以降の寿
命演算が可能である。When an arbitrary input document 1 is input to the life calculator 2 (step 401), the input document 1 is divided into sentences which are units of calculation (step 402). The unit of calculation is not limited to “sentence”, and even a character example (text) physically limited to an arbitrary size can be used for the subsequent life calculation.

【００５１】入力文書１が演算ユニットである文章に分
割された後には、それぞれの文章が順に、演算対象認識
手段２１に送られ、演算対象の認識が行なわれる（ステ
ップ４０３）。この演算対象認識手段２１は、主に、送
られた文章の単語を演算対象として認識し、その単語の
入力文書１の位置を計算し、その結果等を生命表３の該
当するスロット（２１１〜２１３、２２１〜２２３、２
３１〜２３３、２４１〜２４３）にそれぞれ格納する動
作を行なう。演算対象認識手段２１の動作の詳細につい
ては、図８を用いて後述する。After the input document 1 is divided into sentences which are arithmetic units, the respective sentences are sequentially sent to the arithmetic object recognizing means 21 and the arithmetic objects are recognized (step 403). The calculation object recognition means 21 mainly recognizes a word of a sent sentence as a calculation object, calculates the position of the input document 1 of the word, and calculates the result or the like of the corresponding slot (21 1 to 21 1) of the life table 3. 213, 221-223, 2
31-233, 241-243). Details of the operation of the calculation target recognition means 21 will be described later with reference to FIG.

【００５２】演算対象認識手段２１を用いて演算対象を
認識した後には、ポテンシャル演算手段２２が起動され
る（ステップ４０４）。ポテンシャル演算手段２２は、
同一単語のそれぞれについて単語のポテンシャルを計算
し、その結果を生命表３のポテンシャルスロット（２５
１〜２５３）に格納する。ポテンシャル演算手段２２の
動作の詳細については、図９を用いて後述する。After recognizing the calculation target using the calculation target recognition means 21, the potential calculation means 22 is activated (step 404). The potential calculation means 22 is
The potential of the word is calculated for each of the same words, and the result is calculated as the potential slot (25
1 to 253). Details of the operation of the potential calculation means 22 will be described later with reference to FIG.

【００５３】ポテンシャル演算手段２２を用いて単語の
ポテンシャルを求めた後は、寿命推定手段２３を用いて
単語の寿命を推定する（ステップ４０５）。この寿命推
定手段２３は、生命表３の各単語のポテンシャル（２５
１〜２５３）からその単語のポテンシャル分布や単語の
寿命等を計算し、その結果を生命表３のポテンシャル分
布スロット（２６３）や寿命スロット（２６４）に格納
する。寿命推定手段２３の動作の詳細については、図１
０を用いて後述する。After the potential of the word is obtained by using the potential calculating means 22, the life of the word is estimated by using the life estimating means 23 (step 405). The life estimation means 23 uses the potential (25
1-253), the potential distribution of the word, the life of the word, etc. are calculated, and the result is stored in the potential distribution slot (263) and the life slot (264) of the life table 3. For details of the operation of the life estimation means 23, see FIG.
It will be described later using 0.

【００５４】以上のようにして、生命表３の各スロット
に情報が格納された後、寿命演算器２は、文書分割フィ
ルタ４及び特徴単語抽出フィルタ５を起動させ（ステッ
プ４０６）、一連の処理を終了する。After the information is stored in each slot of the life table 3 as described above, the life calculator 2 activates the document division filter 4 and the characteristic word extraction filter 5 (step 406), and a series of processes. To finish.

【００５５】図８は、上述した演算対象認識手段２１の
動作の詳細を示すフローチャートであり、以下では、こ
のフローチャートを参照しながら、演算対象認識手段２
１の動作を詳述する。FIG. 8 is a flow chart showing the details of the operation of the above-mentioned arithmetic object recognizing means 21, and the arithmetic object recognizing means 2 will be described below with reference to this flowchart.
The operation of No. 1 will be described in detail.

【００５６】まず、入力文書１から分割された文章群が
入力されると、それぞれの文章を単語に分割する（ステ
ップ５０１）。単語への分割操作は、例えば公知の形態
素解析技術を用いて行なう。First, when a sentence group divided from the input document 1 is input, each sentence is divided into words (step 501). The operation of dividing into words is performed using, for example, a known morphological analysis technique.

【００５７】次に、分割された単語の一つを処理対象と
して設定し、この単語が新たな見出しを持つならばその
単語に係る新しい生命表ユニット（２００）を生成した
後、その見出しを持つ単語が２度目以降の処理対象とな
ったならばその見出しに係る生命表ユニット（２００）
を認識した後、そのユニット内に見出しスロット（２１
ｉ）を生成してその特定単語の見出しをその中に格納す
ると共に、書誌情報スロット（２４ｉ）を生成してその
特定単語の品詞や文章での役割（「箇条書き中に出現し
た」、「タイトルを形成する単語である」等）を示す書
誌情報を格納する（ステップ５０２）。Next, one of the divided words is set as a processing target, and if this word has a new heading, a new life table unit (200) for that word is generated and then that heading is held. The life table unit (200) related to the heading if the word becomes the second or subsequent processing target
After recognizing, the heading slot (21
i) is generated and the headline of the specific word is stored therein, and the bibliographic information slot (24i) is generated to play the part of the specific word or the role of the sentence (“occurred in bullets”, “ Bibliographic information indicating "the words forming the title" and the like) is stored (step 502).

【００５８】その後、処理対象の単語の物理位置及び論
理位置を順次計算し（ステップ５０３、５０４）、物理
位置スロット（２３ｉ）及び論理位置スロット（２４
ｉ）を生成して、計算によって得た物理位置及び論理位
置をそれぞれ格納する（ステップ５０５）。After that, the physical position and the logical position of the word to be processed are sequentially calculated (steps 503 and 504), and the physical position slot (23i) and the logical position slot (24) are calculated.
i) is generated and the physical position and the logical position obtained by the calculation are respectively stored (step 505).

【００５９】ここで、単語の物理位置及び論理位置の計
算は、寿命演算器２に入力文書１を最初に入力した際に
設定した演算のユニット（この実施形態では「文書」と
して説明する）を単位として演算により求める。Here, the calculation of the physical position and the logical position of a word is performed by a calculation unit (described as "document" in this embodiment) set when the input document 1 is first input to the life calculator 2. Calculated as a unit.

【００６０】物理位置の演算方法例を、入力文書１の第
１５文目が、『／特捜部１／は２／、３／これまでに４
／に５／株６／を７／譲り８／受け９／た１０／民間人
１１／ら１２／の１３／ほとんど１４／から１５／事情
聴取１６／を１７／行なっ１８／た１９／。２０／』
（説明の都合上、単語の区切りを「／」で表し、各単語
の出現順位を数字「１」〜「２０」で示している）とい
う文章であるとして説明する。As an example of the method of calculating the physical position, the 15th sentence of the input document 1 is "/ special search section 1/2/3/4 so far.
/ 5 / share 6/7 / transfer 8 / receive 9/10 / civilian 11/12/13 / almost 14 / to 15 / interview 16/17 / perform 18/19 /. 20 / ”
It is assumed that the sentence is (for convenience of description, a word delimiter is represented by “/” and the appearance order of each word is indicated by numbers “1” to “20”).

【００６１】処理対象単語がこの文章中の「株」である
とすると、その物理位置は、入力文書１での文章の位置
「１５」と、その単語のその文章における出現順序
「６」との組［１５，６］として算出される。Assuming that the word to be processed is "stock" in this sentence, its physical position is the position "15" of the sentence in the input document 1 and the order of appearance "6" of that word in that sentence. It is calculated as a set [15, 6].

【００６２】この物理位置は、以降の論理位置の算出時
に使用される。処理対象単語の物理位置が確定すると、
続いてその単語の論理位置が計算される（上述したステ
ップ５０４）。今、一例として、論理位置を入力文書中
の最大距離を１００（％）に正規化した単語位置で定義
すると、物理位置が［ｍ、ｎ］（特定単語が入力文書の
第ｍ文目の第ｎ番目に出現する）の論理位置（ｍ、ｎ）
は、以下の(1) 式に従って算出できる。This physical position is used in the subsequent calculation of the logical position. When the physical position of the word to be processed is confirmed,
The logical position of the word is then calculated (step 504 described above). As an example, if the logical position is defined as a word position in which the maximum distance in the input document is normalized to 100 (%), the physical position is [m, n] (the specific word is the m-th sentence of the input document). logical position (m, n) of (nth occurrence)
Can be calculated according to the following equation (1).

【００６３】[0063]

【数１】例えば、入力文書１の文の数が「２５０」文であったと
して、先に例示した文章中の「株」は、物理位置が［１
５，６］で表されるので、その論理位置（１５，６）
は、上記(1) 式に具体的な数値を代入した(2) 式によ
り、５．７２％と算出される。すなわち、入力文書１の
先頭から５．７２％の論理位置に「株」という単語が出
現したことにになる。(Equation 1) For example, assuming that the number of sentences in the input document 1 is “250”, the physical position of “stock” in the sentence illustrated above is [1].
5, 6], its logical position (15, 6)
Is calculated as 5.72% by the formula (2) in which a specific numerical value is substituted into the formula (1). That is, the word "stock" appears at the logical position of 5.72% from the beginning of the input document 1.

【００６４】論理位置（１５，６）＝｛（１５−１）＋６／２０｝×１００／２５０＝５．７２（％） …(2) ここでは、以降の演算の便宜上から、入力文書１中の最
大距離を１００（％）に正規化したが、正規化指数は特
に制限を受けて決定される必要がないことは言うまでも
ない。Logical position (15,6) = {(15-1) +6/20} × 100/250 = 5.72 (%) (2) Here, in the input document 1 for the convenience of the following calculation. The maximum distance of is normalized to 100 (%), but it goes without saying that the normalization index does not need to be determined with particular restrictions.

【００６５】以上のようにして求められた処理対象単語
の物理位置及び論理位置がそれぞれのスロットに格納さ
れる（上述したステップ５０５）。The physical position and logical position of the word to be processed obtained as described above are stored in the respective slots (step 505 described above).

【００６６】以上のようにして、今処理対象の単語につ
いて、見出し、物理位置、論理位置、書誌情報のその単
語に係る生命表ユニット２００への格納が終了すると、
未だ処理が行なわれていない未処理単語の有無がチェッ
クされ（ステップ５０６）、未処理単語があるときに
は、処理対象単語をいずれかの未処理単語に設定して上
述のステップ５０２以降の処理を行ない、一方、未処理
単語がないときには（全ての単語の処理が完了したとき
には）、当該演算対象認識手段２１の動作を終了する。As described above, when the heading, the physical position, the logical position, and the bibliographical information of the word to be processed now are stored in the life table unit 200 for the word,
Whether or not there is an unprocessed word that has not been processed is checked (step 506), and if there is an unprocessed word, the word to be processed is set to one of the unprocessed words and the above-mentioned processing from step 502 onward is performed. On the other hand, when there is no unprocessed word (when all the words have been processed), the operation of the calculation target recognition means 21 ends.

【００６７】図９は、ポテンシャル演算手段２２の動作
を示すフローチャートであり、上述したように、演算対
象認識手段２１の動作が終了したときに起動される。FIG. 9 is a flow chart showing the operation of the potential calculating means 22. As described above, the operation is started when the operation recognizing means 21 ends.

【００６８】ポテンシャル演算手段２２は演算処理を開
始すると、まず、生命表３の任意の生命表ユニット（２
００）を選択し、その先頭行の論理位置スロット（２３
１）から論理位置を取出す（ステップ６０１）。When the potential calculation means 22 starts the calculation process, first, an arbitrary life table unit (2
00) and select the logical position slot (23
The logical position is extracted from 1) (step 601).

【００６９】次に、処理対象の生命表ユニットの処理対
象行（上記ステップ６０１により選択した行、又は、後
述するステップ６０６若しくはステップ６０９で選択し
た行）の直前の行の同一単語の論理位置を該当する論理
位置スロットから得る（ステップ６０２）。なお、処理
対象行が生命表ユニットの同一単語の先頭行（ステップ
６０１又はステップ６０９の動作で選択された行）であ
る場合は、直前の行の論理位置が存在しないので、その
論理位置を０％（すなわち、入力文書１の先頭位置を示
す値）とする。Next, the logical position of the same word in the line immediately before the line to be processed (the line selected in step 601 above, or the line selected in step 606 or step 609 described later) of the life table unit to be processed is determined. It is obtained from the corresponding logical position slot (step 602). If the line to be processed is the first line of the same word in the life table unit (the line selected by the operation of step 601 or step 609), the logical position of the immediately preceding line does not exist, so the logical position is set to 0. % (That is, a value indicating the start position of the input document 1).

【００７０】次に、処理対象の生命表ユニットの処理対
象行の直後の行の同一単語の論理位置を、該当する論理
位置スロットから得る（ステップ６０３）。なお、処理
対象行が生命表ユニットの同一単語の論理位置が格納さ
れている最終行（最終的に生命表が完成した時には生命
表ユニット最終行の一つ前の行となる；図５における見
出しスロット「２１３」の行）の場合には、直後の行の
論理位置が存在しないので、その論理位置を１００％
（入力文書１の最後位置を示す値）とする。Next, the logical position of the same word in the line immediately after the line to be processed in the life table unit to be processed is obtained from the corresponding logical position slot (step 603). Note that the line to be processed is the last line in which the logical positions of the same words in the life table unit are stored (when the life table is finally completed, it is the line immediately before the last line in the life table unit; In the case of the slot “213”), since the logical position of the line immediately after does not exist, the logical position is set to 100%.
(Value indicating the last position of the input document 1).

【００７１】以上のようにして得られた「処理対象
行」、「直前行」及び「直後行」のそれぞれの論理位置
から、処理対象行の単語のポテンシャルを計算し、処理
対象行にポテンシャルスロット（２５ｉ）を生成して、
その計算結果であるポテンシャルを格納する（ステップ
６０４）。The potential of the word of the processing target line is calculated from the respective logical positions of the "processing target line", "previous line" and "immediately subsequent line" obtained as described above, and the potential slot is set in the processing target line. Generate (25i),
The potential which is the calculation result is stored (step 604).

【００７２】ここで、この実施形態においては、単語の
ポテンシャルを、処理対象単語と直前に出現した同一単
語との距離（以下、左辺距離と呼ぶ）と、処理対象単語
と直後に出現した同一単語との距離（以下、右辺距離と
呼ぶ）の移動平均（連続する３点で構成される２区間の
距離の平均、すなわち３点の経路の距離の１／２）とし
て定義する。Here, in this embodiment, the word potential is defined as the distance between the processing target word and the immediately preceding same word (hereinafter referred to as the left side distance) and the same word immediately after the processing target word. Is defined as a moving average of distances (hereinafter, referred to as right-side distances) (average of distances of two sections formed by three consecutive points, that is, 1/2 of distance of a route of three points).

【００７３】すなわち、処理対象単語と同じ見出しを持
つ同一単語の論理位置の系列｛Ｄ1，Ｄ2 ，・・・，ＤF
｝（Ｆはその同一単語の出現頻度）を考えると、この
系列においてｉ番目の論理位置にある単語のポテンシャ
ル（ｉ）は、(3) 式のように定義できる。なお、ｉは１
〜Ｆのいずれかである。That is, the sequence of logical positions of the same word having the same heading as the word to be processed {D1, D2, ..., DF
} (F is the frequency of appearance of the same word), the potential (i) of the word at the i-th logical position in this series can be defined as in equation (3). Note that i is 1
To F.

【００７４】ポテンシャル（ｉ）＝｛左辺距離（ｉ）＋右辺距離（ｉ）｝／２左辺距離（ｉ）＝Ｄi −Ｄi-1 （但し、Ｄ0 ＝０）右辺距離（ｉ）＝Ｄi+1 −Ｄi （但し、ＤF+1 ＝１００） …(3) 以上のようにして処理対象行の単語のポテンシャルが確
定した後には、未計算の同一単語が存在するか否かを判
定する（ステップ６０５）。そして、未処理の同一単語
が存在するときには、当該生命表ユニットの処理対象行
の「次の行」を新たに処理対象行として、論理位置スロ
ットから論理位置を取出し（ステップ６０６）、上述し
たステップ６０２に戻る。Potential (i) = {left side distance (i) + right side distance (i)} / 2 left side distance (i) = Di-Di-1 (however, D0 = 0) right side distance (i) = Di + 1 -Di (however, DF + 1 = 100) (3) After the potential of the word of the processing target line is determined as described above, it is determined whether or not the same uncalculated word exists (step 605). ). Then, when the same unprocessed word is present, the "next row" of the processing target row of the life table unit is newly set as the processing target row, and the logical position is taken out from the logical position slot (step 606). Return to 602.

【００７５】従って、ステップ６０２〜６０６でなる処
理ループを繰返し実行することにより、現在処理対象の
生命表ユニットに係る同一見出しの全ての同一単語につ
いてポテンシャルが得られて格納され、やがてステップ
６０５で未処理単語が存在しないという結果が得られる
ことになる。Therefore, by repeatedly executing the processing loop consisting of steps 602 to 606, potentials are obtained and stored for all the same words of the same heading related to the life table unit currently being processed, and in step 605, they are not yet stored. The result is that there is no processed word.

【００７６】このときには、すなわち、未計算の同一単
語が存在しなくなったときには、その処理対象の生命表
ユニットの最後に新たな行を設け、そこに見出しスロッ
ト、頻度スロット、ポテンシャル分布スロット、寿命ス
ロット、ギャップスロット（２６１〜２６５）を生成
し、当該見出しと同一単語の数（頻度）とをそれぞれ該
当するスロット（２６１，２６２）に格納する（ステッ
プ６０７）。なお、この際に生成されたポテンシャル分
布スロット（２６３）、寿命スロット（２６４）及びギ
ャップスロット（２６５）に対する所定情報の格納は、
後述するように、寿命推定手段２３によって行なわれ
る。At this time, that is, when the same uncalculated word no longer exists, a new row is provided at the end of the life table unit to be processed, and a heading slot, a frequency slot, a potential distribution slot, and a life slot are placed there. , Gap slots (261 to 265) are generated, and the numbers (frequency) of the same words as the headline are stored in the corresponding slots (261, 262) (step 607). The storage of the predetermined information in the potential distribution slot (263), the life slot (264) and the gap slot (265) generated at this time is as follows.
As will be described later, it is performed by the life estimation means 23.

【００７７】以上のようにしてある生命表ユニットに対
するポテンシャル演算手段２２の処理が終了すると、生
命表３にポテンシャルの計算処理を実行していない生命
表ユニットが残っているか否かを判定する（ステップ６
０８）。When the processing of the potential calculating means 22 for a certain life table unit is completed as described above, it is judged whether or not there is a life table unit for which the potential calculation processing has not been executed in the life table 3 (step). 6
08).

【００７８】未処理の生命表ユニットが存在するなら
ば、未処理のある生命表ユニットを処理対象の生命表ユ
ニットとして選択し、その先頭行の論理位置スロット
（２３１）より論理位置を取出して上述したステップ６
０２に移行する（ステップ６０９）。これにより、その
ユニットに係る見出しを有する同一単語の出現位置毎の
ポテンシャルが上述と同様にして演算されて格納され
る。If an unprocessed life table unit exists, an unprocessed life table unit is selected as the life table unit to be processed, the logical position is extracted from the logical position slot (231) in the first row, and the above-mentioned is performed. Step 6
02 (step 609). Thereby, the potential for each appearance position of the same word having the heading related to the unit is calculated and stored in the same manner as described above.

【００７９】生命表３における全ての生命表ユニット
（２００）に対するポテンシャル演算処理が終了し、ス
テップ６０８の判定によって、未処理の生命表ユニット
が存在しないという結果が得られると、ポテンシャル演
算手段２２は一連の動作を終了する。When the potential calculation processing for all life table units (200) in the life table 3 is completed and the result of the determination in step 608 is that there is no unprocessed life table unit, the potential calculation means 22 A series of operations ends.

【００８０】図１０は、寿命推定手段２３の動作を示す
フローチャートであり、上述したように、ポテンシャル
演算手段２２の動作が終了したときに、寿命推定手段２
３が起動される。FIG. 10 is a flowchart showing the operation of the life estimating means 23. As described above, when the operation of the potential calculating means 22 is completed, the life estimating means 2
3 is activated.

【００８１】寿命推定手段２３は、動作を開始すると、
まず、生命表３の任意の生命表ユニット（２００）を選
択する（ステップ７０１）。When the life estimation means 23 starts its operation,
First, an arbitrary life table unit (200) of life table 3 is selected (step 701).

【００８２】次に、現在処理対象の生命表ユニットの各
行のポテンシャルスロット（２５１〜２５３）からポテ
ンシャルを取出して、各同一単語のポテンシャルの総和
と、ポテンシャルの２乗和を求める（ステップ７０
２）。また、その生命表ユニットの頻度スロット（２６
２）から頻度を取出す（ステップ７０３）。次に、得ら
れたポテンシャルの総和を取出した頻度で除算すること
で、当該生命表ユニットに係る同一単語の平均ポテンシ
ャルを計算する（ステップ７０４）。さらに、ポテンシ
ャルの２乗和、頻度、及び、平均ポテンシャルから、処
理対象単語のポテンシャルの標準偏差を計算する（ステ
ップ７０５）。Next, the potentials are taken out from the potential slots (251 to 253) of each row of the life table unit currently being processed, and the total sum of the potentials of the same word and the sum of squares of the potentials are obtained (step 70).
2). Also, the frequency slot (26
The frequency is extracted from 2) (step 703). Next, the average potential of the same word related to the life table unit is calculated by dividing the total of the obtained potentials by the frequency of extraction (step 704). Further, the standard deviation of the potential of the processing target word is calculated from the sum of squares of the potential, the frequency, and the average potential (step 705).

【００８３】今、見出しＭを有する同一単語のポテンシ
ャルの系列｛Ｐ（Ｍ）1 ，Ｐ（Ｍ）2 ，・・・，Ｐ
（Ｍ）F(M)｝を考えると、標準偏差（Ｍ）の演算式は
(4) 式で表すことができる。なお、Ｆ（Ｍ）は同一見出
し単語の出現頻度である。Now, a series of potentials {P (M) 1, P (M) 2, ..., P of the same word having the heading M
Considering (M) F (M)}, the arithmetic expression of standard deviation (M) is
It can be expressed by equation (4). Note that F (M) is the appearance frequency of the same headword.

【００８４】[0084]

【数２】以上のようにして、処理対象の生命表ユニットに係る同
一単語の平均ポテンシャル及び標準偏差を得ると、その
生命表ユニットのポテンシャル分布スロット（２６３）
に計算された平均ポテンシャル及び標準偏差の対情報を
格納する（ステップ７０６）。(Equation 2) As described above, when the average potential and standard deviation of the same word related to the life table unit to be processed are obtained, the potential distribution slot (263) of the life table unit is obtained.
The paired information of the calculated average potential and standard deviation is stored in (step 706).

【００８５】次に、各同一単語のポテンシャル、その見
出しを有する単語の頻度、平均ポテンシャル及び標準偏
差から、その見出しを有する単語の寿命を推定し、生命
表ユニットの寿命スロット（２６４）に格納する（ステ
ップ７０７）。Next, the life of the word having the heading is estimated from the potential of each identical word, the frequency of the word having the heading, the average potential and the standard deviation, and stored in the life slot (264) of the life table unit. (Step 707).

【００８６】見出しＭを有する単語に対する寿命（Ｍ）
の推定は、以下の(5) 式の計算により行なう。Lifespan (M) for words with heading M
The estimation of is performed by the calculation of the following equation (5).

【００８７】[0087]

【数３】すなわち、各同一単語のポテンシャルが、平均ポテンシ
ャルから標準偏差だけ大きい方にずれた値より小さい場
合には、同一の文脈、話題内に出現したとし、そのよう
な平均ポテンシャルからのずれが少ないポテンシャルを
有する各同一単語のポテンシャルの総和で寿命を推定す
る。なお、ポテンシャル分布は、値が小さい方に分布し
ているため、平均ポテンシャルから標準偏差だけ小さい
方にずれた閾値を設けてはいない。(Equation 3) That is, if the potential of each same word is smaller than the value deviated from the average potential by a larger standard deviation, it is determined that the same word appears in the same context and topic, and a potential with a small deviation from the average potential is selected. The lifetime is estimated by the sum of the potentials of the same words each has. Since the potential distribution has a smaller value, the threshold value deviated from the average potential by a smaller standard deviation is not provided.

【００８８】ここで、同一の文脈、話題内にあるか否か
を弁別するための特性値としては標準偏差の１倍を用い
たが、この特性値は別のものであっても構わない。例え
ば、標準偏差の、１．５倍や２倍を用いても良く、四分
位偏差の第３四分位数以下のポテンシャルの和を「寿
命」とすることもできる。Although the standard deviation of 1 is used as the characteristic value for discriminating whether or not the user is in the same context and topic, this characteristic value may be different. For example, 1.5 times or 2 times the standard deviation may be used, and the sum of potentials equal to or lower than the third quartile of the quartile deviation may be used as the “lifetime”.

【００８９】次に、ポテンシャルが平均ポテンシャルと
標準偏差の和以下の範囲に収まらない同一単語の論理位
置から、文書１のギャップを導出し、生命表ユニットの
ギャップスロット（２６５）に格納する（ステップ７０
８）。これにより、生命表ユニットが完成したことにな
る。Next, the gap of the document 1 is derived from the logical position of the same word whose potential does not fall within the range of the sum of the average potential and the standard deviation and is stored in the gap slot (265) of the life table unit (step). 70
8). This completes the life table unit.

【００９０】上述したように、単語のポテンシャルは、
左辺距離と右辺距離の平均である。そのポテンシャルが
平均ポテンシャルと標準偏差の和以下の範囲に収まらな
い場合には、左辺距離及び右辺距離のどちら（の影響）
が大きいかをチェックし、影響の大きい方向に文書の文
脈の転換点としてのギャップがあるとみなす。すなわ
ち、ポテンシャルが平均ポテンシャルから大きくずれて
いる（ポテンシャルが平均ポテンシャルと標準偏差の和
以下の範囲に収まらない）出現単語の左辺距離と右辺距
離を比較し、大きい距離を持つ方向とその単語の論理位
置の組を「ギャップ」とする。ポテンシャルが平均ポテ
ンシャルから大きくずれている出現単語の数は、１つだ
けとは限らないので、ギャップスロットに格納されるギ
ャップの数は、複数となることもあり得る。As mentioned above, the potential of a word is
It is the average of the left side distance and the right side distance. If the potential does not fall within the range less than the sum of the average potential and the standard deviation, either (the influence of) the left side distance or the right side distance
Is checked, and it is considered that there is a gap as a turning point in the context of the document in the direction in which the influence is large. That is, the potential is greatly deviated from the average potential (potential does not fall within the range less than the sum of the average potential and standard deviation). The left side distance and the right side distance of the appearing word are compared, and the direction having the large distance and the logic of the word are compared. A set of positions is a “gap”. Since the number of appearing words whose potential is greatly deviated from the average potential is not limited to one, the number of gaps stored in the gap slot may be plural.

【００９１】以上のようにして処理対象の生命表ユニッ
トに係る見出し単語に対する平均ポテンシャル、標準偏
差、寿命及びギャップの算出及び格納が終了すると、未
処理の他の見出し単語に対する生命表ユニットが存在す
るか否かを判定する（ステップ７０９）。未処理の生命
表ユニットが存在するとき、ある未処理の生命表ユニッ
トを選択し（ステップ７１０）、上述したステップ７０
２〜ステップ７０９の処理を行なう。When the calculation and storage of the average potential, standard deviation, lifespan, and gap for the headword related to the life table unit to be processed are completed as described above, there is a life table unit for another unprocessed headword. It is determined whether or not (step 709). When there is an unprocessed life table unit, a certain unprocessed life table unit is selected (step 710) and the above step 70 is performed.
The processing of 2 to step 709 is performed.

【００９２】このようなステップ７０２〜７１０の処理
ループを繰返すことにより、全ての生命表ユニット（見
出し単語）に対する平均ポテンシャル、標準偏差、寿命
及びギャップの算出及び格納が終了し、やがて、ステッ
プ７０９で否定結果が得られ、寿命推定手段２３の動作
を終了する。By repeating the processing loop of steps 702 to 710, the calculation and storage of the average potential, standard deviation, lifespan and gap for all life table units (headwords) are completed, and in step 709. A negative result is obtained, and the operation of the life estimation means 23 is ended.

【００９３】図１１は、文書分割フィルタ４の動作を示
すフローチャートであり、寿命推定手段２３の処理が終
了した後に起動される。FIG. 11 is a flowchart showing the operation of the document division filter 4, which is started after the processing of the life estimation means 23 is completed.

【００９４】文書分割フィルタ４は、動作を開始する
と、まず、分割の対象となる入力文書１を寿命演算器２
から得ると共に（ステップ８０１）、文書の分割の精度
を設定する（ステップ８０２）。When the document division filter 4 starts its operation, first, the input document 1 to be divided is subjected to the life calculator 2
(Step 801) and the accuracy of document division is set (step 802).

【００９５】この文書の分割の精度は、例えば、入出力
インタフェース７を通じて、ユーザから与えられる。外
部から情報が与えられないときには、予め決められてい
る精度（デフォルト精度）を設定する。例えば、分割の
精度として、最も細かく分割した部分でも１段落以上に
なるとか、所定数以上の文章からなるとか設定する。ま
た、分割に階層を持たせる場合には、階層によって異な
るように分割の精度を設定する。The accuracy of the division of this document is given by the user through the input / output interface 7, for example. When no information is given from the outside, a predetermined accuracy (default accuracy) is set. For example, the precision of division is set such that even the most finely divided portion has one paragraph or more, or has a predetermined number of sentences or more. In addition, when the division has a hierarchy, the division accuracy is set so as to differ depending on the hierarchy.

【００９６】次に、生命表３の未処理のある生命表ユニ
ットから、そのユニットに係る見出し単語のギャップを
取出し（ステップ８０３）、そのギャップを、論理位置
をキーとして分類し、論理位置の種類別にその数を積算
する（ステップ８０４）。例えば、入力文書１の各文章
を論理位置の種類とし、ギャップ情報における論理位置
が属する論理位置の種類（文章）をその方向性毎に積算
する。Next, from the unprocessed life table unit of life table 3, the gap of the headword related to the unit is extracted (step 803), and the gap is classified by the logical position as a key, and the type of the logical position is determined. Separately, the number is integrated (step 804). For example, each sentence of the input document 1 is set as the type of the logical position, and the type (sentence) of the logical position to which the logical position in the gap information belongs is integrated for each direction.

【００９７】このような処理を、未処理の生命表ユニッ
トが存在するか否かをチェックしながら（ステップ８０
５）、未処理の生命表ユニットがなくなるまで繰返す。
従って、未処理の生命表ユニットがなくなった段階で
は、積算値が大きい論理位置の種類（例えば文章）が文
書１を分割し得る位置情報を与えていることになる。This processing is performed while checking whether or not there is an unprocessed life table unit (step 80).
5) Repeat until there are no outstanding life table units.
Therefore, at the stage where there is no unprocessed life table unit, the type of logical position having a large integrated value (for example, a sentence) gives position information capable of dividing the document 1.

【００９８】従って、未処理の生命表ユニットがなくな
ると、ギャップの演算結果（積算値）を分割精度で制限
して、入力文書１を文書素片６１に分割して出力する
（ステップ８０６）。Therefore, when there is no unprocessed life table unit, the calculation result (integrated value) of the gap is limited by the division accuracy, and the input document 1 is divided into document pieces 61 and output (step 806).

【００９９】その後、文書インデックス６２に、文書素
片６１に対応するノード（図６参照）を生成し、文書素
片６１の識別子、開始位置及び終了位置を、そのノード
の識別子スロット（３０２）、開始位置スロット（３０
３）及び終了位置スロット（３０４）に格納し（ステッ
プ８０７）、文書分割フィルタ４の動作を終了する。Then, a node (see FIG. 6) corresponding to the document segment 61 is generated in the document index 62, and the identifier, start position and end position of the document segment 61 are assigned to the node identifier slot (302), Start position slot (30
3) and the end position slot (304) (step 807), and the operation of the document division filter 4 is ended.

【０１００】図１２は、特徴単語抽出フィルタ５の動作
を示すフローチャートである。特徴単語抽出フィルタ５
は、寿命推定手段２３の処理が終了した後に起動され、
文書分割フィルタ４の処理と、ステップ９０５の処理を
除き並列処理することが可能である。FIG. 12 is a flowchart showing the operation of the characteristic word extraction filter 5. Feature word extraction filter 5
Is started after the processing of the life estimation means 23 is completed,
It is possible to perform parallel processing except for the processing of the document division filter 4 and the processing of step 905.

【０１０１】特徴単語抽出フィルタ５は、動作を開始す
ると、まず、出力結果として出力する特徴単語の候補数
を設定する（ステップ９０１）。特徴単語の候補数は、
例えば、入出力インタフェース７を通じてユーザから与
えられる。ユーザから情報が入力されないときには、予
め決められている候補数（デフォルト候補数）を設定す
る。When the operation of the characteristic word extraction filter 5 is started, first, the number of characteristic word candidates to be output as an output result is set (step 901). The number of characteristic word candidates is
For example, it is given from the user through the input / output interface 7. When no information is input by the user, a predetermined number of candidates (default number of candidates) is set.

【０１０２】次に、生命表３の未処理のある生命表ユニ
ットから、そのユニットに係る見出し単語の寿命を寿命
スロットから取出し（ステップ９０２）、今まで寿命が
取出された今回の見出し単語を含めた全ての見出し単語
を、その寿命をキーとして、単語の寿命が長い順に並び
換える（ステップ９０３）。このような処理を、未処理
の生命表ユニットが存在するか否かをチェックしながら
（ステップ８０５）、未処理の生命表ユニットがなくな
るまで繰返す。Next, from the unprocessed life table unit of life table 3, the lifespan of the headword related to that unit is taken out from the lifespan slot (step 902), and the headwords of this time whose lifes have been taken out are included. All the headwords are sorted in order of the longest life of the word, with the life of the heading as a key (step 903). Such processing is repeated while checking whether or not there is an unprocessed life table unit (step 805) until there are no unprocessed life table units.

【０１０３】かかる繰返し処理により、未処理の生命表
ユニットがなくなると、見出し単語の並換え結果を設定
した候補数で制限し、その候補数の見出し単語を特徴単
語とし、文書インデックス６２の各ノードの特徴単語ス
ロット（３０５）に格納して（ステップ９０５）、特徴
単語抽出フィルタ５は動作を終了する。When there is no unprocessed life table unit due to such repetitive processing, the result of rearranging the headwords is limited by the set number of candidates, and the headwords of that number of candidates are set as the characteristic words, and each node of the document index 62. (Step 905), and the characteristic word extraction filter 5 ends the operation.

【０１０４】例えば、特徴単語として選定された見出し
単語に係る生命表ユニットからギャップを取出し、その
ギャップ位置を、各文書素片６１に係るノードの開始位
置及び終了位置と比較して、その特徴単語を含む文書素
片６１を認識し、認識した文書素片６１に係るノードの
特徴単語スロット（３０５）にその特徴単語の情報を格
納する。For example, a gap is extracted from the life table unit related to the headword selected as the characteristic word, and the position of the gap is compared with the start position and the end position of the node related to each document segment 61, and the characteristic word is compared. The document segment 61 including the is recognized, and the information of the characteristic word is stored in the characteristic word slot (305) of the node related to the recognized document segment 61.

【０１０５】以上のように、上記実施形態によれば、文
書中の各単語の寿命を得る手段を設けて、単語の寿命の
分布を観察して文書を分割するようにしたので、文書内
の話題の転換が起こった箇所を容易に同定でき、文書を
自動的に適切に分割することができる。従って、上記実
施形態によれば、入力された文書が複合文書であって
も、文書の文脈や話題に即した文書の分割を行なうこと
ができる。As described above, according to the above embodiment, the means for obtaining the life of each word in the document is provided, and the distribution of the life of the word is observed to divide the document. It is possible to easily identify the place where the topic change occurs, and automatically and properly divide the document. Therefore, according to the above embodiment, even if the input document is a compound document, the document can be divided according to the context or topic of the document.

【０１０６】また、上記実施形態によれば、文書を分類
するにつき、人間が文書の分類カテゴリーを照合するた
めの膨大なデータを予め準備することなく、精度の高い
文書の分類が行なうことができる。Further, according to the above-mentioned embodiment, when classifying a document, it is possible to classify the document with high accuracy without preparing a huge amount of data for a person to collate the classification category of the document in advance. .

【０１０７】さらに、上記実施形態によれば、文書中の
各単語の寿命を得る手段を設けて、単語の寿命に基づい
て特徴単語を抽出するようにしたので、入力文書の意図
や文脈を忠実に反映した特徴単語を正確かつ自動的に抽
出することがができる。従って、文書に手作業でキーワ
ードを付与する必要がなくなる。Further, according to the above-described embodiment, since the means for obtaining the life of each word in the document is provided and the characteristic word is extracted based on the life of the word, the intention and the context of the input document are faithful. It is possible to accurately and automatically extract the characteristic words reflected in. Therefore, it is not necessary to manually add a keyword to the document.

【０１０８】なお、上記実施形態においては、本発明の
文書処理装置を、文書の分類を意図した装置に適用した
ものを示したが、他の文書処理装置にその技術的思想を
適用しても良い。例えば、文書を分割するための技術的
思想だけを適用して文書分割装置を構成しても良く、文
書から特徴単語を抽出する技術的思想だけを適用して特
徴単語抽出装置を構成しても良い。さらに、このような
特徴単語抽出装置構成に、特徴単語をベクトルの要素と
する文脈ベクトルを定義し、類似文を検索する機能部を
設けるようにしても良い。In the above embodiment, the document processing apparatus of the present invention is applied to an apparatus intended for document classification, but the technical idea may be applied to other document processing apparatuses. good. For example, the document dividing device may be configured by applying only the technical idea for dividing the document, or the characteristic word extracting device may be configured by applying only the technical idea for extracting the characteristic word from the document. good. Further, in such a feature word extraction device configuration, a context vector having feature words as vector elements may be defined, and a functional unit for searching for similar sentences may be provided.

【０１０９】また、上記実施形態においては、生命表ユ
ニット（寿命等）を文書に含まれている全ての見出し単
語について形成するものを示したが、名詞や動詞等の自
立語についてのみ生命表ユニットを形成するようにして
も良い。さらに、特徴単語を名詞等のいわゆる単語とし
て説明したが、特徴単語を「句」又は「節」の形態にす
ることにより、例えば文書を要約させる装置への応用が
可能になる。In the above embodiment, the life table unit (life span, etc.) is formed for all headwords included in the document. However, life table units (only for independent words such as nouns and verbs) are formed. May be formed. Furthermore, although the characteristic word has been described as a so-called word such as a noun, the characteristic word can be applied to, for example, a device for summarizing a document by forming the characteristic word in the form of “phrase” or “section”.

【０１１０】さらに、ポテンシャルや寿命の定義式は上
記実施形態のものに限定されないことは勿論である。例
えば、書誌情報として「タイトルに使用されている」が
格納されている単語のポテンシャルは、上記計算式に所
定のオフセット値を加えるようにしても良い。また、出
現頻度が所定回数以上の単語のみを寿命算出の対象とし
ても良い。Further, it is needless to say that the defining equations for the potential and the life are not limited to those in the above embodiment. For example, for the potential of a word in which “used for title” is stored as bibliographic information, a predetermined offset value may be added to the above calculation formula. Further, only the words whose appearance frequency is a predetermined number of times or more may be the target of life calculation.

【０１１１】さらにまた、上記実施形態においては、特
徴単語は入力文書に共通に求めて、各分割文書に対応付
けたものを示したが、各分割文書毎にその構成単語の寿
命をソートして抽出するようにしても良い。Furthermore, in the above-described embodiment, the characteristic word is commonly found in the input documents and shown in correspondence with each divided document. However, the lifespans of the constituent words are sorted for each divided document. You may make it extract.

【０１１２】[0112]

【発明の効果】第１の本発明の文書処理装置によれば、
入力文書中の任意単語の出現位置に基づいて、その単語
が文書中で主張機能を作用させている単語の寿命情報を
得る単語寿命演算器と、この単語寿命演算器で得られた
単語の寿命情報を格納する単語寿命情報格納手段と、こ
の単語寿命情報格納手段に格納された単語の寿命情報を
参照して入力文書を分割する文書分割手段とを備えて、
単語の寿命情報を参照して入力文書を分割するようにし
たので、予め用意しておく情報をほとんど不要にした状
況で、文脈や話題の転換に応じて、入力文書を自動的に
分割することができる。According to the document processing apparatus of the first aspect of the present invention,
Based on the appearance position of an arbitrary word in the input document, the word life calculator that obtains the life information of the word in which the word has an asserting function in the document, and the life of the word obtained by this word life calculator A word lifespan information storage means for storing information, and a document division means for dividing the input document by referring to the lifespan information of the words stored in the word lifespan information storage means,
Since the input document is divided by referring to the word life information, it is possible to automatically divide the input document according to the change of context or topic in the situation where the information prepared in advance is almost unnecessary. You can

【０１１３】第２の本発明の文書処理装置によれば、入
力文書中の任意単語の出現位置に基づいて、その単語が
文書中で主張機能を作用させている単語の寿命情報を得
る単語寿命演算器と、この単語寿命演算器で得られた単
語の寿命情報を格納する単語寿命情報格納手段と、この
単語寿命情報格納手段に格納された単語の寿命情報を参
照して、入力文書のインデックスとなる特徴単語を抽出
する特徴単語抽出手段とを備えて、単語の寿命情報を参
照して入力文書中の特徴単語を抽出するようにしたの
で、利用者が入力文書の特徴単語を指定することなく自
動的に特徴単語を抽出できる利便性が優れた装置を提供
できる。According to the document processing apparatus of the second aspect of the present invention, based on the appearance position of an arbitrary word in the input document, the life of the word for which the asserting function is applied in the document is obtained. An index of the input document with reference to the calculator, the word life information storage means for storing the life information of the word obtained by the word life calculator, and the life information of the word stored in the word life information storage means. Since the feature word extracting means for extracting the feature word to extract the feature word from the input document is extracted by referring to the life information of the word, the user can specify the feature word of the input document. It is possible to provide a highly convenient device that can automatically extract a characteristic word without the need.

【０１１４】第３の本発明の文書処理装置によれば、入
力文書中の任意単語の出現位置に基づいて、その単語が
文書中で主張機能を作用させている単語の寿命情報を得
る単語寿命演算器と、この単語寿命演算器で得られた単
語の寿命情報を格納する単語寿命情報格納手段と、この
単語寿命情報格納手段に格納された単語の寿命情報を参
照して入力文書を分割する文書分割手段と、単語寿命情
報格納手段に格納された単語の寿命情報を参照して、入
力文書及び又は分割文書のインデックスとなる特徴単語
を抽出する特徴単語抽出手段とを備えるので、第１及び
第２の本発明の装置による硬化を共に得ることができ
る。According to the document processing apparatus of the third aspect of the present invention, based on the appearance position of an arbitrary word in the input document, the life of the word for which the asserting function of the word is obtained is obtained. The input document is divided by referring to an arithmetic unit, a word lifetime information storage unit for storing the lifetime information of the word obtained by this word lifetime arithmetic unit, and the lifetime information of the word stored in this word lifetime information storage unit. Since the document dividing means and the characteristic word extracting means for extracting the characteristic word serving as the index of the input document and / or the divided document by referring to the life information of the word stored in the word life information storing means are provided, Curing with the device of the second invention can be obtained together.

[Brief description of the drawings]

【図１】実施形態の文書処理装置の機能ブロック図であ
る。FIG. 1 is a functional block diagram of a document processing apparatus according to an embodiment.

【図２】文書処理装置のシステム構成例を示すブロック
図である。FIG. 2 is a block diagram showing a system configuration example of a document processing apparatus.

【図３】従来の文書処理装置において用いるメモリテー
ブルの例である。FIG. 3 is an example of a memory table used in a conventional document processing apparatus.

【図４】従来の文書処理装置において用いるメモリテー
ブルの例である。FIG. 4 is an example of a memory table used in a conventional document processing apparatus.

【図５】実施形態の生命表ユニットの構成例を示す説明
図である。FIG. 5 is an explanatory diagram showing a configuration example of a life table unit of the embodiment.

【図６】実施形態の文書インデックスの格納形式を示す
説明図である。FIG. 6 is an explanatory diagram showing a storage format of a document index according to the embodiment.

【図７】実施形態の寿命演算器の動作フローチャートで
ある。FIG. 7 is an operation flowchart of the life calculator of the embodiment.

【図８】実施形態の演算対象認識手段の動作フローチャ
ートである。FIG. 8 is an operation flowchart of a calculation target recognition unit of the embodiment.

【図９】実施形態のポテンシャル演算手段の動作フロー
チャートである。FIG. 9 is an operation flowchart of the potential calculation means of the embodiment.

【図１０】実施形態の寿命推定手段の動作フローチャー
トである。FIG. 10 is an operation flowchart of a life estimation unit of the embodiment.

【図１１】実施形態の文書分割フィルタの動作フローチ
ャートである。FIG. 11 is an operation flowchart of the document division filter according to the embodiment.

【図１２】実施形態の特徴単語抽出フィルタの動作フロ
ーチャートである。FIG. 12 is an operation flowchart of the characteristic word extraction filter of the embodiment.

[Explanation of symbols]

１…入力文書、２…寿命演算器、３…生命表、４…文書
分割フィルタ、５…特徴単語抽出フィルタ、６…出力結
果（格納部）、２１…演算対象認識手段、２２…ポテン
シャル演算手段、２３…寿命推定手段、６１…文書素片
（格納部）、６２…文書インデックス（格納部）。DESCRIPTION OF SYMBOLS 1 ... Input document, 2 ... Lifetime calculator, 3 ... Life table, 4 ... Document division filter, 5 ... Characteristic word extraction filter, 6 ... Output result (storage part), 21 ... Calculation object recognition means, 22 ... Potential calculation means , 23 ... Life estimation means, 61 ... Document segment (storage unit), 62 ... Document index (storage unit).

Claims

[Claims]

1. Input means for inputting a document described in natural language, and lifetime information of a word which causes the asserting function in the document based on the appearance position of an arbitrary word in the input document. The word life calculator for obtaining, the word life information storage means for storing the life information of the word obtained by the word life calculator, and the above-mentioned input with reference to the life information of the word stored in the word life information storage means A document processing apparatus comprising: a document dividing unit that divides a document.

2. Input means for inputting a document described in natural language, and lifespan information of a word which causes the asserting function in the document based on the appearance position of an arbitrary word in the input document. With reference to the word lifespan calculator for obtaining, the word lifespan information storage means for storing the lifespan information of the word obtained by this word lifespan calculator, and the lifespan information of the words stored in this word lifespan information storage means, A document processing apparatus comprising: a characteristic word extracting unit that extracts a characteristic word that serves as an index of an input document.

3. Input means for inputting a document described in natural language, and lifetime information of a word which causes the asserting function in the document based on the appearance position of an arbitrary word in the input document. The word life calculator for obtaining, the word life information storage means for storing the life information of the word obtained by the word life calculator, and the above-mentioned input with reference to the life information of the word stored in the word life information storage means Document dividing means for dividing a document and characteristic word extracting means for extracting characteristic words serving as indexes of the input document and / or divided documents by referring to the life information of words stored in the word life information storage means. A document processing device comprising:

4. The word life calculator calculates the word position recognition means for recognizing the appearance position of a target word, and the distance between the appearance positions of the word to be processed and the adjacent same word. A potential calculation means for obtaining a potential representing a range affected by the word to be processed, and a life estimation means for obtaining life information of the word based on the potential distribution information of the same word obtained by the potential calculation means. The document processing apparatus according to claim 1, further comprising: