JPH02136956A

JPH02136956A - Extracting method for layout information

Info

Publication number: JPH02136956A
Application number: JP63289963A
Authority: JP
Inventors: Tetsuo Kiuchi; 木内　哲夫; Takeshi Enshi; 圓子　雄; Ichiro Ogura; 一郎小倉
Original assignee: Fuji Electric Co Ltd; Fuji Facom Corp
Current assignee: Fuji Electric Co Ltd; Fuji Facom Corp
Priority date: 1988-11-18
Filing date: 1988-11-18
Publication date: 1990-05-25

Abstract

PURPOSE:To improve the labor saving effect by controlling the document reading results at the head position of a line after dividing them into blocks and handling the output character strings of the reading results at the side of a user device as a document where the line heads are arranged. CONSTITUTION:The characteristic symbols or symbol strings, e.g., G1 - G6 groups 1 - 6 are stored in a dictionary table to be added to the head, the middle and the end of a document respectively. Then the (candidate) characters or character strings of the head or the middle of lines, i.e., the reading results are compared with a dictionary so that an itemized part, for example, can be detected out of a document. Therefore the document reading results can be divided into blocks for each line and the head, middle and end positions of lines are controlled for each block. Thus the output character strings, i.e., the reading results can be treated at the user device side as a document where the head, middle and end of lines are arranged. As a result, a user can omit the layout connection jobs an the labor saving effect is improved.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、文字読取装置によるレイアウト情報の抽出方
法に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a method for extracting layout information using a character reading device.

[Conventional technology]

通常の印刷文書は読む人がその意味内容を理解し易いよ
うに段落骨けしたり、箇条書きにしたりしてレイアウト
上の工夫がしてあり、このようなレイアウト情報は利用
者にとって重宝である。そこで、印刷文書を文字読取装
置にて入力する場合、例えば文頭の空白領域は文字読取
装置によって連続した空白としてテキストに変換し、そ
の文字数は空白領域の幅Ｗを文字幅りで割り算をして換
算するようにしている。Ordinary printed documents have layouts such as broken paragraphs or bullet points to make it easier for readers to understand the meaning and content, and such layout information is useful to users. Therefore, when inputting a printed document using a character reading device, for example, a blank area at the beginning of a sentence is converted into text as a continuous blank by the character reading device, and the number of characters is calculated by dividing the width W of the blank area by the character width. I'm trying to convert it.

〔発明が解決しようとする課題］上述の方法には、以下のような難点がある。[Problem to be solved by the invention] The above method has the following drawbacks.

■文頭の空白領域の幅は写植のときに必ずしも文字数で
指定されるとは限らず、ｍｍ単位で指定されることもあ
るが、この割り算の結果は一般的には割り切れない。■The width of the blank area at the beginning of a sentence is not necessarily specified in terms of the number of characters during phototypesetting, but is sometimes specified in mm units, but the result of this division is generally not divisible.

■例えば、数字「１」の幅は漢字「読」に比べて小さい
。したがって、横書き文書の場合、文頭に「１」がきた
場合は文頭に「読」がきた場合よりも空白領域の幅が大
きくなる。■For example, the width of the number "1" is smaller than that of the kanji "yomi". Therefore, in the case of a horizontally written document, when "1" comes at the beginning of a sentence, the width of the blank area becomes larger than when "yomi" comes at the beginning of the sentence.

■文字幅りの推定値は行毎に求めるため、互いに若干の
差異が生じる。■Since the estimated value of character width is calculated for each line, there will be some differences between them.

■例えば名簿などのように、途中の文字列位置が揃って
欲しい場合がある。■For example, in a list, there are cases where you want the middle character string positions to be aligned.

これらの理由により、文頭の空白の文字数は段落の中で
も一定にならず、文頭や行中１行末の揃った文書を入力
したにもか〜わらず、文字読取装置の利用者に対して文
頭１行中１行末の揃っていない文書が出力される結果と
なる。このため、文書の読み取りそのものには誤りがな
くても利用者は読み取った文書の編集作業をしなければ
ならず、文字読取装置の省力効果を充分に発揮している
とは云い難い、また、ワードプロセッサ、文官処理プロ
グラムは一般的に印刷文書を作成する写植機の有する全
ての機能を網羅しているわけではない。For these reasons, the number of blank characters at the beginning of a sentence is not constant even within a paragraph, and even though the number of blank characters at the beginning of a sentence and the end of one line in a line are uniform, the number of blank characters at the beginning of a sentence is not constant for the user of a character reading device. This results in the output of a document in which the end of one line is not aligned. For this reason, even if there is no error in reading the document itself, the user has to edit the document that has been read, and it is difficult to say that the labor-saving effect of the character reading device is fully demonstrated. Word processors and civil processing programs generally do not cover all the functions of phototypesetting machines that create printed documents.

特にレイアウト情報、フォント情報については限られた
機能しかサポートされていないのが現状である。Currently, only limited functions are supported especially regarding layout information and font information.

以上、要するに文字読取装置は文書の初期入力や既存文
書の再利用の際の省力化を目的としており、その理想は
読み取りの対象となる原稿からそのテキスト情報、レイ
アウト情報、フォント情報等もそっくりそのま一１再利
用可能な計算機情報に変換されることにあると云える。In summary, the purpose of character reading devices is to save labor during the initial input of documents and when reusing existing documents. First, it can be said that it is converted into reusable computer information.

しかるに、従来の文字読取装置にて行頭や行中の空白を
処理した場合、文頭や行中の文字が揃った文書を入力し
ても、行頭または行中文字の揃っていない文書が出力さ
れることがあるため、利用者は文字の読み取りそのもの
には誤りがなくても、読み取った文書の編集作業をしな
ければならず、省力効果が充分に生かされていないと云
う問題がある。However, when a conventional character reading device processes spaces at the beginning of a line or in the middle of a line, even if a document in which the characters at the beginning of a sentence or in a line are aligned is input, a document with the characters at the beginning of a line or in a line not aligned is output. Therefore, even if there is no error in the reading of the characters, the user has to edit the read document, resulting in the problem that the labor-saving effect is not fully utilized.

したがって、本発明は文字読取装置からレイアウト情報
を抽出できるようにすることを目的とする。Therefore, an object of the present invention is to enable layout information to be extracted from a character reading device.

[Means to solve the problem]

同一段落内で行頭または行中文字１行末を揃える必要の
ある場合としては、例えば箇条書きの場合がある。この
ような場合は、例えば（１）、（２）、　　（３）・・
・・・・、■、■、■・・・・・・のような記号または
記号列が多く用いられる。そこで、このような特徴的な
記号または記号列を辞書として記憶しておき、原稿の行
頭または行中にこれらの記号または記号列を検出したと
きは、その行頭まだは行中にそのことを示す情報、例え
ばタブ記号を挿入する。そして、文書が出力されるとき
に行頭または行中の位置が揃うように、このタブ記号を
一定の数の空白に起き換える。なお、名簿などでは「」
（空白）、「・・・・」などが特徴的な記号または記号
列となる。Examples of cases in which it is necessary to align the beginning of a line or the end of each line of characters within the same paragraph include bulleted lists. In such a case, for example (1), (2), (3)...
Symbols or symbol strings such as ..., ■, ■, ■, etc. are often used. Therefore, such characteristic symbols or symbol strings are stored in a dictionary, and when these symbols or symbol strings are detected at the beginning of a line or in a line of a manuscript, it is indicated at the beginning of the line or in the line. Insert information, for example a tab symbol. Then, these tab symbols are replaced with a certain number of spaces so that the beginning or middle of a line is aligned when the document is output. In addition, in lists etc.
(blank), "...", etc. are characteristic symbols or symbol strings.

[Effect]

例えば、（１）、　　（２）、　　（３）・・・・・・
　（ａ）（ｂ）、（Ｃ）・・・・・・のような文頭に特
徴的な記号または記号列を検出することにより、文書の
読み取り結果を行単位でブロック化して行頭位置を制御
できるので、読取結果の出力文字列がそれを利用する装
置側で行頭の揃った文書として扱え、利用者はレイアウ
ト上の修正作業を省くことができ、省力効果が増大する
。For example, (1), (2), (3)...
By detecting characteristic symbols or symbol strings at the beginning of sentences, such as (a), (b), (C), etc., it is possible to control the position of the beginning of a line by dividing the document reading result into blocks line by line. Therefore, the output character string of the reading result can be treated as a document with the beginnings of the lines aligned on the side of the device that uses it, and the user can omit the work of modifying the layout, increasing the labor-saving effect.

また、名簿などの場合は「」（空白）を特徴的な記号ま
たは記号列とすることで行中の文字位置を合わせること
ができ、原稿そのま＼のレイアウトで入力することが可
能となる。これにより、氏名、出身地、生年月日、住所
などのアイテムも同時に抽出することができ、データベ
ース化も容易になる。In addition, in the case of a directory, by using "" (blank) as a characteristic symbol or symbol string, the character positions in the lines can be adjusted, making it possible to input the data in the same layout as the original. This allows items such as name, place of birth, date of birth, and address to be extracted at the same time, making it easier to create a database.

〔Example〕

図は本発明の詳細な説明するための説明図である。これ
は箇条書き文書の行頭に来る特徴的な記号または記号列
（以下、頭文字列とも云う、）の例を示し、文字読取装
置または計算機内のテーブルに辞書として記憶されるも
のである。なお、同図において、グループＧｌは括弧付
き数字（１）（２）、　　（３）・・・・・・の例、グ
ループＧ２は丸付き数字の■、■、■、■・・・・・・
の例、グループＧ３は実子文字ａ、ｂ、ｃ、ｄ・・・・
・・の例、グループＧ４は英大文字Ａ、Ｂ、Ｃ，Ｄ・・
・・・・の例、グループＧ５およびＧ６はそれぞれ記号
「■」、「・」の例である。The figure is an explanatory diagram for explaining the present invention in detail. This shows an example of a characteristic symbol or symbol string (hereinafter also referred to as an initial character string) that appears at the beginning of a line in a bulleted document, and is stored as a dictionary in a table in a character reading device or computer. In the figure, group Gl is an example of numbers in parentheses (1), (2), (3), etc., and group G2 is an example of numbers in circles: ■, ■, ■, ■...・
For example, group G3 has real child letters a, b, c, d...
For example, group G4 consists of uppercase letters A, B, C, D, etc.
. . , groups G5 and G6 are examples of the symbols "■" and ".", respectively.

こ＼で、箇条書きの場合、「■」もｒ　（１）Ｊも使用
されるが、これらが混在して使用されることはない。し
かし、ｒ（１）１とｒ（２）１　は混在して使用される
。即ち、ｒ（１）　、　（２）　、　（３）　、　（４
）・・・」、「■、■。Here, in the case of bullet points, both "■" and r (1) J are used, but they are never used together. However, r(1)1 and r(2)1 are used together. That is, r(1), (2), (3), (4
)...", "■,■.

■、■・・・」はそれぞれグループを形成する。一方、
「・」、「■」は単独でグループを形成する。そこで、
グループ同志の区別がつくようにグループには番号をつ
け、グループ内で序列の付くものであれば番号をつける
。ｍ）　、　（２）　、　（３）　、　（４）・・・」
では（１）の序列は１、（２）の序列は２である。［（
ａ）　、　（ｂ）　、　（ｃ）　、　（ｄ）　−Ｊでは
（ａ）の序列は１、（ｂ）の序列は２である。■の序列
は「■、■、■。"■, ■..." each form a group. on the other hand,
"・" and "■" form a group by themselves. Therefore,
Numbers are assigned to groups so that they can be distinguished from each other, and numbers are assigned if there is a hierarchy within the group. m), (2), (3), (4)...''
Then, the order of (1) is 1, and the order of (2) is 2. [(
In a), (b), (c), (d) -J, the ranking of (a) is 1 and the ranking of (b) is 2. The order of ■ is “■, ■, ■.

■・・・」では２、「■、◎、０．０・・・」では１で
ある。It is 2 for "■..." and 1 for "■, ◎, 0.0...".

また、グループＧ１に示すように、記号（列）ｒ（１）
」などはｒ（１）」１文字とも、または「（」　「ｌ）
」の２文字の組合わせとも、さらには’　（Ｊ　　’Ｉ
Ｊ　　’）　Ｊの３文字の組合わせとも認識しうる。こ
のように複数の文字列の組合わせがある場合、すべてを
記憶しておくこと＼する。但し、代表的な組合わせ方式
を１つ選んで分かるようにしておき、これを例えば組合
わせの先頭に置（。そして、文書から行を切り出す時に
行頭の空白領域の幅をそれぞれ物理量に変換可能な単位
（例えば、ｍｍ、ｐｏｉｎむ、ｄａｔ）で記憶し、これ
を行頭空白テーブルと呼ばれるテーブルに格納しておく
。Also, as shown in group G1, symbol (column) r(1)
” is r(1)” or “(” “l)”
' (J 'I
J') It can also be recognized as a combination of the three letters J. If there are multiple combinations of character strings like this, remember all of them. However, select one typical combination method and make it easy to understand, and place this at the beginning of the combination (for example. Then, when cutting out lines from a document, the width of the blank area at the beginning of the line can be converted into a physical quantity. It is stored in units (for example, mm, point, dat) in a table called a blank table at the beginning of a line.

このようなテーブルを作成した後、認識結果の行頭の（
候補）文字列と頭文字列辞書とを比較し、一致するもの
があれば、行番号２頭文字列のグループ番号、グループ
内での序列を記憶する。文書もしくはページの中で同一
グループ番号の頭文字列が序列に従って出現する範囲を
調べ、ブロックと呼ぶ。どのグループにも属さない行を
集めてルートブロックと呼ぶ。ブロックは連続した複数
行で構成される。ｒＡ、Ｂ、Ｃ・・・」等は頭文字列と
して使用されていない場合でも行頭に来ることがあるの
で、序列に従わない場合や単独行の場合はブロックとは
しない。したがって、このような行は頭文字列が検出さ
れてもルートブロックに属する。なお、ルートブロック
は飛び飛びの行で構成されても良い。そして、同一ブロ
ック内の空白領域の幅を行頭空白テーブルから呼び出し
て平均をとり、行頭空白テーブルを書き換える。ルート
ブロックについても同様にする。また、認識結果のテキ
ストの行頭にはタブ記号を挿入しておく。After creating such a table, (
Compare the character string (candidate) with the initial character string dictionary, and if there is a match, store the line number, the group number of the initial character string, and the order within the group. The range in which initial letter strings of the same group number appear in order in a document or page is examined and is called a block. A collection of lines that do not belong to any group is called a root block. A block consists of multiple consecutive lines. rA, B, C...'' etc. may appear at the beginning of a line even if they are not used as an initial character string, so if they do not follow the order or are on a single line, they are not considered a block. Therefore, such a line belongs to the root block even if the initial string is detected. Note that the root block may be composed of discrete rows. Then, the widths of blank areas in the same block are retrieved from the line head blank table, averaged, and the line head blank table is rewritten. Do the same for the root block. Additionally, a tab symbol is inserted at the beginning of the line of the text of the recognition result.

しかる後、文書が出力されるときには、テキストを利用
する装置において行頭の位置が揃うように各行の行頭の
タブ記号を置き換える。即ち、行頭の位置をｍｍ単位で
指定できる文書処理プログラムであればそのフォーマッ
トに従ってタブ変換し、空白の数でしか行頭の位置を指
定できないワードプロセッサであれば、行頭空白テーブ
ルを参照して文書もしくはページで一定の文字幅によっ
て一定の数の空白に、変換する。ブロック内の頭文字列
のない行は、頭文字列を有する行よりも段を下げる。こ
のために挿入する空白の数は、頭文字列辞書における代
表組合せの文字数に一致させる。Thereafter, when the document is output, the tab symbol at the beginning of each line is replaced so that the beginnings of the lines are aligned on a device that uses text. In other words, if it is a word processing program that can specify the position of the beginning of a line in millimeters, it will convert tabs according to that format, and if it is a word processor that can specify the position of the beginning of a line only by the number of spaces, it will convert the document or page by referring to the beginning of line blank table. Converts a constant character width to a constant number of spaces. Lines within a block without an initial string are lower in rank than lines with an initial string. The number of spaces inserted for this purpose is made to match the number of characters of the representative combination in the initial character string dictionary.

以上が行頭の場合であり、例えば名簿などでは［」（空
白）を着目記号とし、タブを付けて整列することになる
。なお、行中２行末についても必要に応じて上記と同様
の処理が可能であることは云う迄もない。The above is the case at the beginning of a line; for example, in a directory, ['' (blank) is the focus symbol, and tabs are added to arrange the list. It goes without saying that the same process as above can be performed at the end of the second line if necessary.

〔Effect of the invention〕

文書の行頭または行中３行末に特徴的な記号または記号
列、例えば（１）、（２）、（３）・・・・・・（ａ）
、（ｂ）、（ｃ）・・・・・・　「」（空白）のような
記号または記号列を辞書テーブルに記憶しておき、読取
結果の行頭または行中の（候補）文字または文字列を辞
書と比較することにより、文書中の例えば箇条書きの部
分を検出することができる。これにより、文書の読取結
果を行単位でブロック化することができ、ブロック単位
で行頭位置１行中位置１行末位置を制御できるので、読
取結果の出力文字列をそれを利用する装置側で行頭。Characteristic symbols or symbol strings at the beginning of a line or at the end of three lines of a document, such as (1), (2), (3)... (a)
, (b), (c)...... Store symbols or symbol strings such as "" (blank) in a dictionary table, and select (candidate) characters or character strings at the beginning or in the line of the reading result. For example, bullet points in a document can be detected by comparing them with a dictionary. As a result, the reading results of a document can be divided into blocks line by line, and the line start position, middle line position, and line end position can be controlled for each block. .

行中３行末の揃った文書として扱うことが可能となる。It becomes possible to handle the document as a document with the same endings for all three lines.

その結果、利用者はレイアウト上の修正作業を省くこと
ができ、省力効果が増大すると云う利点がもたらされる
。As a result, the user can save the work of modifying the layout, and the advantage is that the labor-saving effect is increased.

[Brief explanation of the drawing]

図は本発明による方法を説明するための説明図である。符号説明０１〜Ｇ５・・・・・・記号または記号列グループ。代理人　弁理士　並　木　昭　夫代理人　弁理士　松　崎　　　清手続補正書（方式）１、事件の表示士田文毅殿昭和６３年特許願第２８９９６３号２、発明の名称レイアウト情報の抽出方法３、補正をする者事件との関係　　特許出願人住　所　川崎市川崎区田辺新田ｌ＠１号名　称　（５２
３）富士電機株式会社４、代　理　人　８１０５　　電話０３　（５８０）　
９５１３５、補正命令の日付６、補正の対象（ほか１名）昭和６３年１２月２０日明細書および図面７、補正の内容（１）明細書第６頁第２行（発明の詳細な説明の欄）に
おいて「図は」とあるのを「第１図は」に訂正し、同じ
く明細書第１０頁第１５行（図面の簡単な説明の欄）に
おいて「回は」とあるのを「第１図は」に訂正する。（２）図面を別紙のとおり補正する（すなわち、従来の
図面を、図面の全体にかけて第１図と図番号を付した図
面に補正する。１図The figure is an explanatory diagram for explaining the method according to the present invention. Code explanation 01-G5...Symbol or symbol string group. Agent Patent Attorney Akio Namiki Agent Patent Attorney Kiyoshi Matsuzaki Procedural Amendment (Method) 1. Display of the case Mr. Shida Bunki Patent Application No. 289963 of 1989 2. Method for extracting invention title layout information 3. Amendment Relationship with the case of a person who does
3) Fuji Electric Co., Ltd. 4, Agent 8105 Telephone 03 (580)
95135, date of amendment order 6, subject of amendment (one other person) December 20, 1985 Specification and drawings 7, contents of amendment (1) Specification, page 6, line 2 (detailed description of the invention) 10th line of the specification (brief explanation column), the phrase ``The drawing is'' was corrected to ``Fig. Figure 1 has been corrected to ``. (2) Amend the drawing as shown in the attached sheet (that is, amend the conventional drawing to a drawing with Figure 1 and figure numbers attached throughout the drawing. Figure 1)

Claims

[Claims]

When a predefined characteristic symbol or symbol string is memorized and the memorized symbol or symbol string is detected in a document to be read, a predetermined symbol indicating this is displayed at the corresponding position. A method for extracting layout information, characterized in that the information is generated and made available as information for document layout.