JPS6225324A

JPS6225324A - Document understanding system

Info

Publication number: JPS6225324A
Application number: JP60164113A
Authority: JP
Inventors: Yasuaki Nakano; 中野　康明; Koji Yokoyama; 横山　晃二; Shoichi Nakagami; 昇一中上; Junichi Tono; 東野　純一; Hiromichi Fujisawa; 浩道藤澤
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1985-07-26
Filing date: 1985-07-26
Publication date: 1987-02-03

Abstract

PURPOSE:To extract a bibliography item automatically by making divisions according to the structure of a fixed form document. CONSTITUTION:Grammar which expresses the structure of a document is used and the syntax of a description is analyzed by the grammar to grasp the structure of an unknown input document. This grammar expressed a document which is inputted as a character sequence as a set of rectangular areas and contains variables of quantities indicating the absolute or relative sized of the rectangular areas or the absolute or relative relation among the rectangular areas. Further, a searching method for the rectangular areas can be specified. Further, a rectangular area is expressed as a set of searched areas and the format of the document is expressed finely by the hierachical expression. The formats of various documents expressed according to the grammar are stored previously in a memory. When a character sequence is inputted as an unknown document, a syntax analysis part searches for rectangular areas by a searching method specified by a document format to analyze the structure of the document.

Description

【発明の詳細な説明】〔発明の利用分野〕本発明は文書理解方式に係り、特に電子的文書ファイル
装置の入力部として好適な文書理解方式〔発明の背景〕従来の電子的文書ファイル装置は単に文書の各頁の内容
を文字系列として格納するのみであり、検索のための二
次情報として必要となる書誌事項（文書の名称・発行日
・文書番号など）はキーボ−ドの符号入力手段から外部
から与えてやる必要があった。しかし、ファイル作成作
業の省力化のためしこは文書中に記載されている表題や
著Ｍなどを自動的に理解して書誌事項データを自動的に
生成することが望ましい。さらに検索を高度化するため
には図表のキャプションや章・節表題の自動検出あるい
は本文自体の理解による自動キーワード抽出などが必要
となる。また文書中の文字系列の中に図形コマンドを含
むことが普通になっているが、対象文書の文字系列を表
題・著者・要約・本文・図形コマンドなどの部分に分割
することは、検索の多様化のために要請されていた。[Detailed Description of the Invention] [Field of Application of the Invention] The present invention relates to a document understanding method, and in particular, a document understanding method suitable as an input section of an electronic document file device [Background of the Invention] A conventional electronic document file device The content of each page of a document is simply stored as a character sequence, and bibliographic information (document name, publication date, document number, etc.) required as secondary information for searching is entered using the code input method on the keyboard. It had to be given from outside. However, in order to save labor in file creation work, it is desirable to automatically understand the title, author, etc. written in a document and automatically generate bibliographic data. In order to further advance the search, it is necessary to automatically detect captions of figures and tables, chapter/section titles, or automatically extract keywords by understanding the text itself. In addition, it is common to include graphical commands in the character series in a document, but dividing the character series of the target document into parts such as title, author, summary, main text, and graphical commands is useful for various searches. It was requested for the purpose of

従来技術ではワードプロセッサなどにおいて、見出し文
を別途入力して文書の検索を容易にする方法が提案され
ており、たとえば「日立日本語ワードプロセッサＷｏｒ
ｄＰａｌ　２５操作説明書（使いこなしＪｉ）Ｊ３ｓ頁
（株式会社日立製作所、昭和５９年７月１日第１版発行
）にこのような方法の一例が開示されている。しかし、
この方法は書誌事項そのものの指定ではなく、使用者の
心覚え程度の文章を入力するものであり、本格的な検索
用の書誌事項を用いるためには、別途インチフサが書誌
事項を与える必要があった。In the prior art, methods have been proposed to facilitate document searches by inputting headings separately in word processors, etc. For example, "Hitachi Japanese Word Processor Wor
An example of such a method is disclosed in the dPal 25 Operating Instructions (Ji), page J3s (Hitachi, Ltd., first edition published on July 1, 1980). but,
This method does not specify bibliographic items themselves, but rather inputs sentences that the user remembers, and in order to use bibliographic items for full-fledged searches, it is necessary for Inchfusa to provide bibliographic items separately. Ta.

[Purpose of the invention]

本発明の目的は、定型化された一般の文書を対象とし、
その構造に従って分割を行うことにより、書誌事項の自
動抽出を可能とする文書理解方式を提供することにある
。The purpose of the present invention is to target stylized general documents,
The object of the present invention is to provide a document understanding method that enables automatic extraction of bibliographic items by dividing the document according to its structure.

[Summary of the invention]

かかる目的を達成するために、本発明においては文書の
構造を表現する文法を用い、この文法によって表現され
た記述を構文解析することにより。In order to achieve this purpose, the present invention uses a grammar that expresses the structure of a document, and parses the description expressed by this grammar.

未知入力文書の構造を把握するものである６上記の文法
では、文字系列で与えられる文書を矩形領域の集合とし
て表現し、上記矩形領域の絶対的あるいは相対的な大き
さ及び矩形領域間の絶対的あるいは相対的な関係を表す
数量を変数として含んでいる。また、矩形領域の探索方
法を指定することができる。さらに、矩形領域の探索領
域の集合として表現し、このような階層的な表現によっ
て。6 In the above grammar, a document given as a character sequence is expressed as a set of rectangular areas, and the absolute or relative size of the rectangular areas and the absolute size between the rectangular areas are Contains quantities that represent physical or relative relationships as variables. Additionally, the search method for the rectangular area can be specified. Furthermore, it is expressed as a set of rectangular search areas, and by such a hierarchical representation.

文書の書式を細部に至るまで表現できる。Document formats can be expressed down to the smallest detail.

各種文書に対し、」：記の文法に従って表現された文書
の書式があらかじめメモリ内に格納されている。構文解
析部では未知文書を表す文字系列が入力されると、文書
書式で指定された探索方法に従って矩形領域を探索し、
探索が成功したか否かの情報と探索時に定まるパラメー
タ（矩形領域の絶対的あるいは相対的な大きさ及び矩形
領域間の絶対的あるいは相対的な関係）を表す数値を抽
出する。構文解析部は、上記のパラメータの数値を文書
書式の中の変数に代入し、次の解析を行うことにより、
順次文書の構造解析を進める。For each type of document, a document format expressed according to the following grammar is stored in memory in advance. When the syntax analysis unit receives a character sequence representing an unknown document, it searches a rectangular area according to the search method specified in the document format.
Information on whether the search was successful or not and numerical values representing parameters determined during the search (absolute or relative size of rectangular areas and absolute or relative relationship between rectangular areas) are extracted. The syntax analysis unit assigns the numerical values of the above parameters to variables in the document format and performs the following analysis.
Sequentially proceed with structural analysis of documents.

本発明の詳細な説明する前に本発明の詳細な説明する。Before giving a detailed explanation of the present invention, a detailed explanation of the present invention will be given.

第１図に一定の書式を有する技術論文の一部の例を示す
。この頁は文字符号系列と、それらが真上にどのような
位置に及びどのような形式で印刷されるべきかを示す書
式情報とによって生成されたものである。以下の説明で
は対象として技術論文を例にとるが、他の文書であって
も文法の形式が若干異なるのみであり、文法の一部を変
更すれば本発明が適用でき、本発明は上記技術論文の一
例に限定されるものではない。FIG. 1 shows an example of a part of a technical paper having a certain format. This page is generated by a sequence of character codes and formatting information indicating where they should be placed directly above and in what format they should be printed. In the following explanation, a technical paper will be taken as an example, but the grammar format is only slightly different even for other documents, and the present invention can be applied by changing a part of the grammar. It is not limited to one example of a paper.

次に、文書の構造を表現する文法（以Ｆ文害文法と略す
る）の−例を示す。Next, an example of a grammar expressing the structure of a document (hereinafter abbreviated as F-grammar) will be shown.

（ｄｅｆｆｏｒｎ＋　Ｆ（土ｏｒｍ　　Ｆｌ　　　　（１０９０１０４０））（
ｆｏｒｍ　　Ｆ’２− （ｆｏｒｍ　Ｆ３−　　　　　　　）））（ｄｅｆｆｏ
ｒｍ　　Ｆｌ（ｆｏｒｍ　ＦＲＩＩ　（０１００１０５０）Ｓ　ｔ　
ｒ　ｉ　ｎに　′論文））（ｆｏｒｍ　Ｆｌ２（’？Ｘｍｊｎ　　’；’Ｘｍａｘ　　？Ｙｍｉｎ　　
７Ｙｖａａｘ）ｓｈｒｉｎｋ））（ｄｅｆｍａｃ　ＬＩＮＥ−１（％１）　；　１行目（
ｐａｉｎｔ　７Ｙ１　（ｍｏｄｅ　ＬＮ　　Ｙ　ＬＥＳ
Ｓ））（ｐａｉｎｔ　？Ｙ２　（ｍｏｄｅ　ＯＵＴ　Ｙ
　ＬＥＳＳ））（ｆｏｒｍ％１　　（０’？Ｉｌｌ　？
ＹＬ　？Ｙ２））（ｄｅｆｍａｃ　ＬＩＮＥ−２（％１
）；２行目（ｐｏｉｎｔ　？Ｙ２　（ｎ＋ｏｄｅ　ＯＵ
Ｔ　Ｙ　ＬＥＳＳ））（ｐａｉｎｔ　　’ニアＹ３　　
（ｍｏｄｅ　　丁Ｎ　　　Ｙ　　ＬＥＳＳ））（ａｒｅ
ａ　　（０’ｉ’Ｗ　？Ｙ２　？Ｈ）））（ｐａｉｎｔ
　　’ｉ’Ｙ４　　（ｍｏｄｅ　ＯＵＴ　Ｙ　　ＬＥＳ
Ｓ））（ａｒｅａ　　（０’？ｌｊ　７Ｙ３７ｔ（））
）（ｆｏｒｍ　％１　　　（０’７Ｗ　７Ｖ３　？Ｙ４
））第１図の例を参照しながら上記の文法について説明
する。(deforn + F (Sat orm Fl (10901040)) (
form F'2- (form F3- ))) (defo
rm Fl (form FRII (01001050)S t
r i n 'paper)) (form Fl2 ('?Xmjn';'Xmax ?Ymin
7Yvaax)shrink)) (defmac LINE-1(%1); 1st line (
paint 7Y1 (mode LN Y LES
S)) (paint?Y2 (mode OUT Y
LESS))(form%1 (0'?Ill?
YL? Y2)) (defmac LINE-2(%1
); 2nd line (point ?Y2 (n+ode OU
T Y LESS)) (paint 'Nia Y3
(mode ding N Y LESS)) (are
a (0'i'W ?Y2 ?H))) (paint
'i'Y4 (mode OUT Y LES
S)) (area (0'?lj 7Y37t())
)(form %1 (0'7W 7V3 ?Y4
)) The above grammar will be explained with reference to the example of FIG.

最初のｄｅｆｆｏｒｍ　Ｆ・・・は、書式Ｆが第２図の
ように、ＩＩ弐Ｆ１の下部に書式Ｆ２及びＦ３が横に並
んだものが付随して構成されることを示す。第１図では
第２図に対応したＦ、Ｆｌ、Ｆ２．Ｆ３の部分は破線で
囲んで示しである。書式名Ｆ１の次の（）で挾まれた４
個の数値１０　９０　１０　４Ｑは全領域Ｆを１００Ｘ１００としたときの相対的なＦｌ
の大きさを示す、このようにパラメータの値が既知のと
きは、その値を直接記述すればよい。The first deform F... indicates that the format F is constructed with formats F2 and F3 arranged horizontally at the bottom of II2F1, as shown in FIG. In FIG. 1, F, Fl, F2. The portion F3 is shown surrounded by a broken line. 4 enclosed in parentheses following format name F1
The numerical value 10 90 10 4Q is the relative Fl when the total area F is 100X100
When the value of a parameter that indicates the size of is known, it is sufficient to directly write the value.

次のｄｅｆｆｏｒｍ　Ｆｌ・・・は、書式Ｆ１が、さら
に書式ＦｉｌとＦｌ２が縦に並んで構成されることを示
す。Ｆｉｌにおける５ｔｒｉＢ　　’論文はＦｉｌの中に文字データ″論文″が存在することを示
す。このｓｔｒｉｎｇ指定はあってもなくてもよく、指
定がないときは文字データの存在はチェックしない。Ｆ
ｌ２における？の付いた文字は変数を表し、対象によっ
て変動するものである。The following deform Fl... indicates that the format F1 is further composed of formats Fil and Fl2 arranged vertically. 5triB' paper in Fil indicates that character data "paper" exists in Fil. This string specification may or may not be present, and if it is not specified, the presence of character data is not checked. F
In l2? The letters marked with ``represent variables,'' which vary depending on the target.

４個の変数（’？Ｘｍ１ｎ　”７Ｘｍａｘ　？Ｙｍｉｎ　？ｙｍａ
ｘ）はＩ？１２の相対的な大きさであり、後述するよう
に探索で求まった値がこれらの変数に代入される。4 variables ('?Xm1n "7Xmax ?Ymin ?yma
x) is I? 12, and the values found through the search are substituted into these variables as will be described later.

５ｈｒｉｎｋ　書式に対応した矩形領域を文字成分が外
接するまで縮小することを示す。５ｈｒｉｎｋ指定もあ
ってもなくてもよい。5hrink Indicates that the rectangular area corresponding to the format is reduced until the character component is circumscribed. 5hrink specification may or may not be specified.

次のｄｅｆｍａｃ　ＬＩＮＥ−１（％１）以降は、マク
ロ定義により】−行目ＬＩＮＥ−］の定義を簡単化した
部分である。書式％１は（ｆｏｒｍ％ｌ（０？Ｉｉｌ　’７Ｙ１　？Ｙ２）　）
によりマクロ定義される。？Ｈは書式の横方向の大きさ
、？Ｈは書式の縦方向の大きさを表す。；以下はコメン
トである。The part after the next defmac LINE-1 (%1) is a simplified part of the definition of the -th line LINE-] by the macro definition. The format %1 is (form%l(0?Iil '7Y1 ?Y2))
Defined as a macro. ? H is the horizontal size of the format, ? H represents the vertical size of the format. ;The following is a comment.

Ｐａ１ｎｔはある条件を満足する点を探索し、変数に代
入することを示す。探索条件はｍｏｄｅによって指定す
る。ＩＮ・ＯＵＴは探索点が空白から文字への変化点か
文字から空白への変化点かを示し、Ｙは探索＠（Ｘまた
はＹ）を示し、ＬＥＳＳは探索方向を表す。ａｒｅａは
探索範囲の領域を示す。Pa1nt indicates searching for a point that satisfies a certain condition and assigning it to a variable. Search conditions are specified by mode. IN/OUT indicates whether the search point is a change point from a blank to a character or a change from a character to a blank, Y indicates a search @ (X or Y), and LESS indicates a search direction. area indicates the area of the search range.

文書の理解においては、文法に則って書かれた表現を参
照し、その中に記述された矩形領域が文書に存在するか
否かを順次側べて行く。変数を含んで記述された矩形領
域が探索されると、その変数の数値が得えらることとな
り、以後はその数値を変数に代入して用いる。In understanding a document, the user refers to expressions written in accordance with the grammar and sequentially checks whether the rectangular area described therein exists in the document. When a rectangular area described including a variable is searched, the numerical value of that variable is obtained, and from now on, the numerical value is substituted into the variable and used.

次に、矩形領域間の演算について説明する。実際の文書
では矩形以外の形状をした領域も出現する。第３図（Ａ
）、（Ｂ）は矩形以外の形状をした領域の例である。ま
た、（Ｃ）は一つの矩形領域が二つの矩形領域に分離し
た例を示す。第３図（Ａ）、（Ｂ）は、それぞれ破線で
示すように、二つの矩形領域の和あるいは差として考え
られる。Next, calculations between rectangular areas will be explained. In actual documents, areas with shapes other than rectangles also appear. Figure 3 (A
) and (B) are examples of regions having shapes other than rectangles. Further, (C) shows an example in which one rectangular area is separated into two rectangular areas. FIGS. 3A and 3B can be considered as the sum or difference of two rectangular areas, respectively, as indicated by broken lines.

また、（Ｃ）は二つの矩形領域がつながって仮想的に一
つの矩形領域に纏まっていると考えれば、表現が単純に
なる。このような矩形領域間の演算を可能にするため、
次のように領域の仮想的な転送を定義する。In addition, (C) can be expressed simply if it is considered that two rectangular areas are connected and virtually combined into one rectangular area. To enable operations between rectangular areas like this,
Define virtual transfer of area as follows.

（ｌＩｌａｐ＆ｆｏｒｍ　Ｆ（ｓｐａｃａ　’搾？Ｈ）（ｐｏｓｉｔｉｏｎ　（？ＸＯ？ＹＯ）（’？Ｘｍ１ｎ
　’？Ｘｍａｘ　７Ｙｍｊ−ｎ　７ｙｍａｘ）））この
仮想的転送の意味を第４図により説明する４゜５ａｐｃ
ｅは、新しく書式Ｆとして幅？Ｗ、高さ？Ｈの矩形領域
を設定し、この領域中に転送が行われることを示す。ｐ
ｏｓｉｔｉｏｎは転送先の矩形領域の左上の座標を表す
。４個の値（？Ｘｍ１ｎ　７ＸＩＩＩａｘ　’：’Ｙｍｉｎ　？ｙ
ｍａｘ）））で示される転送元の矩形領域を、上記の転
送先に複写する。(lIlap&form F (spaca 'squeeze?H) (position (?XO?YO) ('?Xm1n
'? Xmax 7Ymj-n 7ymax))) 4゜5apc to explain the meaning of this virtual transfer using Figure 4
Is e the width as a new format F? W, height? A rectangular area of H is set to indicate that the transfer is to be performed within this area. p
position represents the upper left coordinates of the rectangular area of the transfer destination. 4 values (?Xm1n 7XIIIax':'Ymin ?y
The transfer source rectangular area indicated by max))) is copied to the above transfer destination.

以上に説明した仮想的転送を組み合わせれば、第３図に
示したような複雑な形状の領域は二つ以上の矩形領域間
の演算によって表現することができる。たとえば、第３
図（Ａ）は大きさの異なる二つの矩形領域を隣接させて
転送したものとして表現できる。By combining the virtual transfers described above, a region with a complex shape as shown in FIG. 3 can be expressed by calculations between two or more rectangular regions. For example, the third
Figure (A) can be expressed as two rectangular areas of different sizes transferred adjacently.

以上の説明から分るように本発明で提案した文書文法で
は、文書の構造を矩形領域の組み合わせとして把握し、
矩形領域間の関係を文法で表現しているので文書の表現
力が増し、領域内の行数が不定の場合や、矩形領域が出
現するか否かが不定の場合など、従来取り扱いが困難で
あった対象も記述できる。従って、多種多様の文書が解
析可能となる。As can be seen from the above explanation, the document grammar proposed by the present invention understands the structure of a document as a combination of rectangular areas,
Expressing the relationship between rectangular areas using grammar increases the expressiveness of the document, and it is difficult to handle in cases where the number of lines in an area is undefined or whether a rectangular area appears is undefined. You can also describe the objects that existed. Therefore, a wide variety of documents can be analyzed.

[Embodiments of the invention]

以下、本発明の実施例について図面を用いて詳細に説明
する。Embodiments of the present invention will be described in detail below with reference to the drawings.

第４図は本発明の一実施例による文書処理方式を採用し
た装置の構成を示すブロック図である。FIG. 4 is a block diagram showing the configuration of an apparatus employing a document processing method according to an embodiment of the present invention.

装置の各部はバス１に接続され、全体の動作は制御部２
により制御される。文書３上の情報（文字系列）はキー
ボード４より文字符号系列として入力され、バス１を介
してメモリ５１に格納される。Each part of the device is connected to bus 1, and the overall operation is controlled by control unit 2.
controlled by Information (character sequence) on the document 3 is input as a character code sequence from the keyboard 4 and stored in the memory 51 via the bus 1.

メモリ５］、は後述する５２〜５５とともにメモリ５の
一部をなす。文字符号系列５１をキーボード４から得る
代わりに、゛磁気ディスクなどのファイル装置から読み
こんでもよい。また、この文字符号系列は紙面上の文字
パターンを文字認識装置によって読み取ったものであっ
てもよい。また、文字符号系列には書式制御データを含
んでもよい。Memory 5] forms part of the memory 5 together with 52 to 55, which will be described later. Instead of obtaining the character code series 51 from the keyboard 4, it may be read from a file device such as a magnetic disk. Further, this character code series may be obtained by reading a character pattern on a sheet of paper using a character recognition device. Further, the character code series may include format control data.

入力文字系列に対し制御部２により公知の清書プログラ
ム処理を行うことにより、１頁のイメージに対応した清
書文字系列がメモリ５２に行列データの形で格納される
。By subjecting the input character sequence to a known transcription program process by the control unit 2, the transcription character sequence corresponding to the image of one page is stored in the memory 52 in the form of matrix data.

前述した文法に則って書かれた対象文書の書式データが
、あらかじめメモリ５３に格納されているものとする。It is assumed that the format data of the target document written in accordance with the above-mentioned grammar is stored in the memory 53 in advance.

制御部２は、この書式データを用いて上記の清書文字系
列の文書理解処理を行う。The control unit 2 uses this format data to perform the above-described document understanding process of the neat character series.

ここで文書理解処理とは、上記の頁イメージの清書文字
系列を複数の矩形領域に分解し、その各領域の分類を行
うことをいう。文書理解結果として得られる各領域のう
ち、検索対象領域としてあらかじめ定められた領域につ
いて、その部分の文字系列を検索情報として検索データ
用メモリエリア５４に登録する。登録に際してデータの
加工・編集処理を加えてよい。以上のようにして得られ
た入力文書の検索情報をファイル６に、文書の文字符号
系列をファイル７に出力する。文書の文字符号系列のフ
ァイル７への出力に際しては１分解された複数の矩形領
域単位で別々に出力してもよい。Here, the document understanding process refers to dividing the neat character sequence of the above-mentioned page image into a plurality of rectangular areas and classifying each of the rectangular areas. Of each area obtained as a result of document understanding, for an area predetermined as a search target area, the character sequence of that part is registered in the search data memory area 54 as search information. Data may be processed and edited upon registration. The input document search information obtained as described above is output to file 6, and the character code series of the document is output to file 7. When outputting the character code series of the document to the file 7, it may be output separately in units of a plurality of rectangular areas that are decomposed into one.

また、ファイル６とファイル７は同一のものとしてもよ
い。Further, file 6 and file 7 may be the same.

以下に文書理解処理の詳細を述べる。第６図及び第７図
は、文書理解の処理の流れを説明する図である。処理の
流れは、Ｐ　Ａ　Ｄ　（Ｐｒｏｇｒａｍ　Ａｎａｌｙｓ
ｉｓＤｉａｇｒａｍ）形式で書かれている。１０１で各
文字符号の頁イメージ上での位百座標Ｘ（ｉ）　　　Ｙ（ｉ、）を抽出する。１０３，１０４，１０５はそれぞれ構文解
析処理の初期化２本体終了判定である。The details of the document understanding process will be described below. 6 and 7 are diagrams illustrating the flow of document understanding processing. The process flow is PAD (Program Analyzes
isDiagram) format. In step 101, the digit coordinates X(i) Y(i,) of each character code on the page image are extracted. Reference numerals 103, 104, and 105 indicate whether the initialization 2 main body of the parsing process is complete.

１０３ではメモリ５，３に格納されている書式データを
作業用メモリ５５に複写し、各種テーブルやプログラム
内部変数の初期化を行う。At 103, the format data stored in the memories 5 and 3 is copied to the working memory 55, and various tables and program internal variables are initialized.

構文解析処理の本体１０４は、】、０６〜１２０から構
成される。１０６は、１０７〜１１９の処理を１２０で
終了判定が行われるまで繰り返し行うように制御する。The main body 104 of the syntax analysis process is composed of ], 06 to 120. Reference numeral 106 controls the processes 107 to 119 to be repeatedly performed until the end determination is made in 120.

１０７では書式データ中のステートメントを取り出す。At step 107, the statement in the format data is extracted.

処理未了ステートメントとは、その中に含まれる変数で
値の定まっていないものがあるか、または対応する文書
領域がまだ決定されていないような行を指す。１０８は
、処理未了ステートメントが残っていない場合は１０９
〜」、１９の処理をスキップする判定である。An unprocessed statement refers to a line in which there are variables whose values have not been determined or whose corresponding document area has not yet been determined. 108 is 109 if there are no unprocessed statements remaining
~'', this is a determination to skip the process of step 19.

この場合には終了判定が行われることになる。In this case, a termination determination will be made.

１０７で取り出したステートメントが処理未了ステート
メントの場合、１０９〜１１７の処理が行ねれる（１１
０〜１１７の処理は第７図に示す。）１０９は、ステー
トメントの種類を判定して分岐する部分で、ステートメ
ントの種類に応じて１１０〜１１７の部分の処理が変化
する。第６，７図及び以下の説明では、ｆｏｒｍステー
トメント、すなわち（ｆｏｒｍ　　ＦＯ（？Ｘｍ１ｎ　　？Ｘｍａｘ　　？Ｙｍｉｎ　　？ｙｍ
ａｘ）５ｈｒｉｎｋ）の場合についてのみ述べるが、他のステートメントでも
同様にそのステートメント特有の処理が行われる。If the statement retrieved in 107 is an unprocessed statement, the processing in 109 to 117 cannot be performed (11
The processing of 0 to 117 is shown in FIG. ) 109 is a part that determines the type of statement and branches, and the processing in parts 110 to 117 changes depending on the type of statement. In Figures 6 and 7 and in the following description, the form statement, i.e. (form FO (?Xm1n ?Xmax ?Ymin ?ym
Although only the case of ax)5hrink) will be described, processing specific to that statement is similarly performed for other statements as well.

１　］、　Ｏ〜１１９は述語ｆｏｒｍを処理する部分で
ある。１１０では書式名称ＦＯが登録済みか否かを調べ
、未登録ならば１１１で書式テーブルにＦＯを登録する
。１１１では、変数名（’ｉ’Ｘｍ１ｎ　？Ｘｍａｘ　’７Ｙｍｉｎ　７Ｙｍ
ａｘ）の位置に書かれた文字列が変数か数値か、変数な
ら登録済みか否かを調べ、未登録ならこれらを変数表に
登録する。変数が登録済みならばその値が確定している
か否かを調へ、確定していなければｆｏｒＩ１１処理は
終了する（この場合このステートメントは処理未終了と
なる）。確定していれば、ステートメント中の変数名を
上記の数値で書き換える。1 ], O to 119 are parts that process the predicate form. In step 110, it is checked whether the format name FO has been registered, and if not, in step 111, FO is registered in the format table. 111, the variable name ('i'Xm1n ?Xmax '7Ymin 7Ym
Check whether the character string written in position ax) is a variable or a numerical value, and if it is a variable, check whether it has been registered or not, and if it is not registered, register it in the variable table. If the variable has been registered, it is checked whether its value has been determined, and if it has not been determined, the forI11 process ends (in this case, this statement is not completed). If confirmed, rewrite the variable name in the statement with the above value.

具体例として、？Ｘｍ１ｎ＝Ｏ，？Ｘｍａｘ＝９０゜？Ｙｍｉｎ、７Ｙｍａｘ　　：未登録のとき、曲屈のステートメントは（ｆｏｒｍ　　ＦＯ（０９０？Ｙｍｉｎ　　？Ｙｍａｘ）ｓｈｒｉｎｋ）と書き換えられ、変数？Ｙｍｉｎ、７Ｙｍａｘ　が変数
テーブルに登録されて、値未確定となる。As a specific example, ? Xm1n=O,? Xmax=90°? Ymin, 7Ymax: When unregistered, the bending statement is rewritten as (form FO (090?Ymin?Ymax)shrink) and the variable? Ymin and 7Ymax are registered in the variable table and their values are undetermined.

１１２で、ステー１〜メント中の変数名が全で数値に書
き換えられているか否かにより分岐し、全て数値に書き
換えられていたとき、１１３のｆｏｒｍ実行処理を行う
。ｆｏｒｍ実行処理の詳細は１１４〜１１８で表される
。】−１４は、１０２で抽出ごれた文字ｉについて以下
の処理を操り返すことを示す。１１５では、文字ｉのＸ
座標及びＹ座標Ｘ（ｉ）　　Ｙ（ｉ）をステートメント中の変数？Ｘｍ１ｎ　　７Ｘｍａｘ　　’ｉ’Ｙｍｉｎ　　’ｉ
’Ｙｍａｘに対応する数値と比校し？　Ｘｍ１ｎ＜　Ｘ　（ｉ　）　＜　？　Ｘｍａｘ？　
Ｙｍｉｎ＜　Ｙ　（ｉ　）　＜　？　Ｙｍａｘが成立す
る文字か否かを判定する。１１６では、−上記の条件が
成立したとき、その文字ｉをＦＯの成分テーブルに登録
する。１１７では。、上記の条件が成立する文字が存在
しないとき、解析失敗のフラグを立てる。At step 112, a branch is made depending on whether or not all of the variable names in statements 1 to 1 have been rewritten to numerical values, and if all have been rewritten to numerical values, the form execution process at step 113 is performed. Details of the form execution process are represented by 114 to 118. ]-14 indicates that the following process is repeated for the character i extracted in step 102. 115, the X of the letter i
Are coordinates and Y coordinates X(i) Y(i) variables in the statement? Xm1n 7Xmax 'i'Ymin 'i
'Compared to the value corresponding to Ymax? Xm1n<X(i)<? Xmax?
Ymin<Y(i)<? It is determined whether the character Ymax is satisfied. In step 116, - when the above condition is met, the character i is registered in the component table of FO. At 117. , if a character that satisfies the above conditions does not exist, a parsing failure flag is set.

以上説明したように１０６〜１１７の処理により、書式
データ中のステートメントｆｏｒｍに対応する構造が入
力文字系列に存在することを検出できる。ｆｏｒｍステ
ートメントにおいてｓｔｒｉｎｇ指定があるときは、指
定された文字データが入力文字系列中に存在するか否か
を、単語照合によって求めればよい。単語照合に際し、
同義語辞書を参照して、同義語が存在したときは同一視
するようにしてもよい。ｆｏｒｍ以外のステートメント
についても同様に解析ができる。ステートメント中に変
数が含まれるときは、ステートメント中の変数に解析時
に求めたパラメータを代入し、その結果が他のステート
メントで用いられる。As explained above, by the processes 106 to 117, it is possible to detect that a structure corresponding to the statement form in the format data exists in the input character series. When string is specified in the form statement, whether or not the specified character data exists in the input character series can be determined by word matching. When matching words,
A synonym dictionary may be referred to, and if synonyms exist, they may be treated as the same. Statements other than form can also be analyzed in the same way. When variables are included in a statement, the parameters determined during analysis are assigned to the variables in the statement, and the results are used in other statements.

１１８では、解析失敗フラグを調べ、解析が失敗したと
き後戻りして再試行する。この場合、解析済みのステー
トメントに戻ってパラメータを代入した変数をまた以前
の状態に書き直し、別の可能性を探索するように制御す
る。In step 118, the analysis failure flag is checked, and if the analysis fails, the process backs up and tries again. In this case, control is provided to return to the parsed statement, rewrite the variables to which the parameters were assigned to their previous state, and search for other possibilities.

１１９では、解析失敗フラグが立っていないか、あるい
は後戻り再試行の後解析失敗フラグがあるかを検出し、
終了判定を行う。In step 119, it is detected whether the analysis failure flag is not set or whether there is an analysis failure flag after going back and retrying.
Make a termination judgment.

１０５は解析の結果得られたデータを外部に受は渡す部
分である。外部に受は渡すデータとしては、書式名称に
対応して検出した矩形領域の文書上での座標などがある
。105 is a part that receives and passes data obtained as a result of analysis to the outside. The data to be passed to the outside includes the coordinates on the document of the rectangular area detected corresponding to the format name.

解析失敗フラグを立てる指定のあるステー１ヘメントで
解析が失敗したとき、この文書は理解不能であり、この
ときはりジエクト処理を行う。たとえば文書理解の最終
結果あるいは中間結果をディスプレイ８に表示しキーボ
ード４を用いるなどして、マンマシン的に修正してもよ
い。When the analysis fails at a stage 1 in which an analysis failure flag is specified, the document is unintelligible, and in this case, the extract processing is performed again. For example, the final result or intermediate result of document understanding may be displayed on the display 8 and the keyboard 4 may be used for correction in a man-machine manner.

〔Effect of the invention〕

以上説明したごとく１本発明によれば格納すべき対象文
書の解析を自動的に行うことが可能であり、キーボード
から書誌情報を入力することが不要となるかあるいは大
幅に削減されるので、人力がきわめて簡素化される。ま
た、対象文書の形式が変化しても書式データを変更すれ
ば、直ちに対応できるなどの利点がある。As explained above, according to the present invention, it is possible to automatically analyze target documents to be stored, and the need to input bibliographic information from a keyboard is eliminated or greatly reduced, reducing human labor. is extremely simplified. Another advantage is that even if the format of the target document changes, it can be handled immediately by changing the format data.

[Brief explanation of the drawing]

第１図は文書の一例を示す参考図、第２，３゜４図は本
発明の詳細な説明するための説明図、第５図は本発明の
文書処理方式を実施する装置の構成を示すブロック図、
第６図、劣中井は第５図中の制御部２における処理を説
明するための流れ図である。】・・バス、２・・・制御部、３・・・文書、４・・・
キーボード、５・・・メモリ、６，７・・・ファイル、
８・・・ディスプレイ。Fig. 1 is a reference diagram showing an example of a document, Figs. 2, 3 and 4 are explanatory diagrams for explaining the present invention in detail, and Fig. 5 shows the configuration of a device implementing the document processing method of the present invention. Block Diagram,
FIG. 6 is a flowchart for explaining the processing in the control section 2 in FIG. 5. ]...Bus, 2...Control unit, 3...Document, 4...
Keyboard, 5...Memory, 6,7...File,
8...Display.

Claims

[Scope of Claims] 1. A memory that stores an expression written in accordance with a grammar that describes a document as a set of rectangular areas, a means for obtaining a character code series representing the document, and a method for obtaining a character code series representing the document; and a syntactic analysis unit that analyzes a document by searching for a rectangular area specified by an expression written according to the above grammar. A document understanding method characterized by generating data representing the meaning of a rectangular area. 2. In the document understanding method described in claim 1, the grammar includes a quantity representing the absolute or relative size of the rectangular areas and the absolute or relative relationship between the rectangular areas as variables. It also includes a description of a method for searching the rectangular area, and searches for a rectangular area specified by an expression written according to the grammar from the character code series,
Assign the value determined from the search result to the variable in the above expression,
A document understanding method characterized by having a syntax analysis unit that performs analysis until there are no unknown variables. 3. In the document understanding method described in claim 1, the document is characterized in that the grammar includes one virtual rectangular area generated by performing calculations from a plurality of spatially separated rectangular areas. understanding method.