JPH06203020A

JPH06203020A - Method an device for recognizing and generating text format

Info

Publication number: JPH06203020A
Application number: JP4361390A
Authority: JP
Inventors: Minoru Ashizawa; 実芦沢
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1992-12-29
Filing date: 1992-12-29
Publication date: 1994-07-22

Abstract

PURPOSE:To generate a conversion result document having a text format being equivalent to an iput by eliminating text format information and executing various conversion processing of a machine traslation, etc., with respect to a document subjected to text format without a troublesome manual work. CONSTITUTION:With regard to the combination of each row of an input file, the degree of coincidence is calculated and page length is inferred, A range in which the degree of coincidence is a prescribed value or above at the same line interval as the number of pages and which is continued from a page boundary is recognized as a page header or a page footer. A range in which a column whose null character rate is a prescribed value or above is continued is decided to be the boundary of the step set area, the step set and the chart of each page are recognized, and a text is extracted. By the ratio of the number of specific characters, the recognition of a chart area, and the recognition of a processing unnecessary line are executed. After the conversion processing is executed, a conversion result document having a text format being equivalent to an input is generated. In such a manner, the manhour of the eliminating work of text format information, and a text format work after the translation can be reduced.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、テキストとしてフォー
マット済みの文書データから、フォーマット情報とテキ
スト情報を分離して、再合成するテキストフォーマット
認識生成方法および装置に関し、例えば、電子メールや
機械可読な媒体によって配布される文書の文字コードデ
ータ列および紙などに印刷された文書から光学的文字認
識装置によって生成した文字コードデータ列を対象とす
る自動翻訳に用いて好適であり、また、データベース構
築、キーワード抽出、その他のテキストプロセッシング
を行うための前処理として用いて好適なテキストフォー
マット認識生成方法および装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a text format recognition generation method and apparatus for separating format information and text information from document data formatted as text and resynthesizing the text information. It is suitable for automatic translation targeting a character code data string of a document distributed by a medium and a character code data string generated by an optical character recognition device from a document printed on paper or the like, and a database construction, The present invention relates to a text format recognition generation method and apparatus suitable for use as preprocessing for performing keyword extraction and other text processing.

【０００２】[0002]

【従来の技術】いわゆるワープロなどで作成された文書
における、ダブルスペース、左端設定、拡大文字などの
フォーマット（文書書式）情報は、その文書をテキスト
ファイル（フォーマット情報を含まない文字コード情報
のみのファイル）に変換した場合には、空行あるいは空
白として表現される。この変換後のテキストファイルを
解析してフォーマット情報による空白、空行を除去する
ことの必要性が、“機械翻訳電子メールシステム”（西
野文人、中村直人、情報処理学会自然言語処理研究会７
５―５（１９９０．１））において指摘されている。し
かし、その方法の実現方法については報告が無く、結局
は人間の手作業によって除去する。2. Description of the Related Art In a document created by a so-called word processor, format (document format) information such as double space, left edge setting, and enlarged character is a text file (a file containing only character code information that does not include format information). ), It is represented as a blank line or blank. The need to analyze the converted text file and remove the blanks and blank lines due to the format information is due to the "machine translation e-mail system" (Fumito Nishino, Naoto Nakamura, IPSJ Natural Language Processing Research Group 7
5-5 (1990.1)). However, there is no report on how to implement the method, and eventually it will be removed manually.

【０００３】また、テキスト内に、図表や数式などがあ
る場合、その図表や数式は翻訳不要であることが多く、
これらに機械翻訳処理を行うと却って意味不明な結果と
なることが多い。このような翻訳不要部分に対しては、
前編集と呼ぶ人間の手作業によって、翻訳不要であるこ
とを示す記号をテキスト内に挿入する。In addition, when there are figures and formulas in the text, the figures and formulas often do not require translation,
Machine-translation processing on these often results in meaninglessness. For such untranslated parts,
By a manual process called pre-editing, a symbol indicating that no translation is required is inserted in the text.

【０００４】光学的文字認識装置を用いて文書のテキス
トを認識する場合は、ページ単位に画像を読み込んだ直
後か、あるいは文字を認識した後の時点で、専用のエデ
ィタによって、段組、図表などの領域の指定と各領域の
接続の順序の指定を、人間の手作業によって行う必要が
ある。When recognizing text of a document using an optical character recognizing device, a column editor, a chart, etc. are displayed by a dedicated editor immediately after reading an image in page units or after recognizing characters. It is necessary to manually specify the areas and the order of connection of each area.

【０００５】また、機械翻訳を行なった後に、原文の文
書構造に応じてタイトル行や段落を指定する清書用コマ
ンドを自動的に翻訳結果中に埋め込むことについては、
上記の文献“機械翻訳電子メールシステム”において、
その実現が報告されている。しかし、ページ長、ページ
ヘッダー、ページフッター、および段組などについて
は、人間の手作業によって指定し、文書のフォーマット
を再構成するしかなかった。Regarding the automatic embedding of a clean copy command for designating a title line or a paragraph in accordance with the document structure of the original sentence after the machine translation,
In the above-mentioned document "Machine translation electronic mail system",
The realization has been reported. However, the page length, the page header, the page footer, the column, and the like have to be manually specified and the document format has to be reconstructed.

【０００６】[0006]

【発明が解決しようとする課題】上記従来の技術におい
て、ページヘッダー、ページフッター、段組、および図
表割り付けなどのテキストフォーマット済みのテキスト
データに対して機械翻訳を行なうためには、これらのペ
ージヘッダー、ページフッター、段組、および図表の領
域の切り出しとこれら領域の順序の指定を人間の手作業
で行う必要がある。したがって、大量のデータを処理す
るためには、多くの工数が必要であるという問題点があ
る。In the above conventional technique, in order to perform machine translation on text-formatted text data such as page headers, page footers, columns, and chart allocation, these page headers are used. It is necessary to manually cut out areas of pages, page footers, columns, and charts and specify the order of these areas. Therefore, there is a problem that a lot of man-hours are required to process a large amount of data.

【０００７】また、翻訳不要部分であることを示す記号
を挿入する前編集も人間の手作業で行う必要があるた
め、やはり大量のデータを処理するためには、多くの工
数が必要であるという問題点がある。Further, since it is also necessary for humans to manually perform pre-editing for inserting a symbol indicating a translation-unnecessary portion, a lot of man-hours are required to process a large amount of data. There is a problem.

【０００８】さらに、翻訳結果に原文と同様の清書用コ
マンドを埋め込むためのテキストフォーマットプログラ
ムは、原文のテキストフォーマット済み文書と同等のフ
ォーマットを、翻訳結果に付与できるとは限らない。同
等のフォーマットを付与できない場合は、テキストフォ
ーマットプログラムの処理結果の文書に対して、人間の
手作業で修正を行なう必要がある。したがって、大量の
データを処理するためには多くの工数が必要であるとい
う問題点がある。Furthermore, a text format program for embedding a clean copy command similar to the original text in the translation result cannot always give a format equivalent to the text-formatted document of the original text to the translation result. If the equivalent format cannot be added, it is necessary to manually correct the document resulting from the processing of the text format program. Therefore, there is a problem that many man-hours are required to process a large amount of data.

【０００９】本発明の目的は、上記従来例における問題
点に鑑み、人間の手作業で行っていた面倒な上述の各作
業を自動化することができるテキストフォーマット認識
生成方法および装置を提供することにある。In view of the problems in the above conventional example, an object of the present invention is to provide a text format recognition generating method and apparatus capable of automating each of the above-mentioned troublesome operations manually performed by humans. is there.

【００１０】[0010]

【課題を解決するための手段】上記目的を達成するた
め、本発明に係るテキストフォーマット認識生成方法
は、ページヘッダー付け、ページフッター付け、段組、
図表割り付けなどのテキストフォーマット済みのテキス
トファイルを入力するステップと、入力したテキストフ
ァイルのページ長を推定するステップと、ページフッタ
ーおよび／またはページヘッダーを認識するステップ
と、段組や図表などの領域を認識するステップと、認識
した領域がテキスト領域か図表領域かを表す種別を認識
するステップと、認識した領域の接続順序を決定するス
テップと、上記接続順序にしたがって、複数個の領域に
またがるテキストと図表を抽出するステップと、抽出し
たテキストと図表に対し、所定の変換を施すステップ
と、その変換結果のテキストと図表に対し、入力のテキ
ストファイルと同等のフォーマットを付与するステップ
とを備えたことを特徴とする。To achieve the above object, a text format recognition generation method according to the present invention is provided with a page header, a page footer, a column,
Input a text-formatted text file such as chart layout, estimate the page length of the input text file, recognize the page footer and / or page header, and specify areas such as columns and charts. A step of recognizing, a step of recognizing a type indicating whether the recognized area is a text area or a figure area, a step of determining a connection order of the recognized areas, and a text spanning a plurality of areas according to the connection order. A step of extracting a chart, a step of performing a predetermined conversion on the extracted text and the chart, and a step of giving a format equivalent to the input text file to the text and the chart of the conversion result Is characterized by.

【００１１】また、本発明に係るテキストフォーマット
認識生成装置は、ページヘッダー付け、ページフッター
付け、段組、図表割り付けなどのテキストフォーマット
済みのテキストファイルを入力する手段と、入力したテ
キストファイルのページ長を推定する手段と、ページフ
ッターおよび／またはページヘッダーを認識する手段
と、段組や図表などの領域を認識する手段と、認識した
領域がテキスト領域か図表領域かを表す種別を認識する
手段と、認識した領域の接続順序を決定する手段と、上
記接続順序にしたがって、複数個の領域にまたがるテキ
ストと図表を抽出する手段と、抽出したテキストと図表
に対し、所定の変換を施す手段と、その変換結果のテキ
ストと図表に対し、入力のテキストファイルと同等のフ
ォーマットを付与する手段とを備えたことを特徴とす
る。Further, the text format recognition generation device according to the present invention has means for inputting a text-formatted text file such as page header attachment, page footer attachment, column setting, chart allocation, etc., and page length of the input text file. Means for recognizing, a means for recognizing a page footer and / or a page header, a means for recognizing an area such as a column or a chart, and a means for recognizing a type indicating whether the recognized area is a text area or a chart area. A means for determining a connection order of the recognized areas, a means for extracting a text and a chart that span a plurality of areas in accordance with the connection order, a means for applying a predetermined conversion to the extracted text and a chart, Add the same format as the input text file to the converted text and figures. Characterized by comprising a means.

【００１２】[0012]

【作用】本発明に係るテキストフォーマット認識生成方
法は、例えば、入力装置、表示装置、ファイル記憶装
置、翻訳装置、およびシステム装置から構成される機械
翻訳システムなどに適用する。まず、入力装置を介して
処理対象（ここでは翻訳対象）のテキストフォーマット
済み文書を入力して、ファイル記憶装置に入力ファイル
として格納する。The text format recognition generation method according to the present invention is applied to, for example, a machine translation system including an input device, a display device, a file storage device, a translation device, and a system device. First, a text-formatted document to be processed (here, a translation target) is input via an input device and stored in a file storage device as an input file.

【００１３】システム装置は、ページ長推定方法、ペー
ジヘッダー認識方法、ページフッター認識方法、領域認
識方法、領域種別認識方法、領域接続順序決定方法、テ
キスト抽出方法、翻訳方法、テキストフォーマット生成
方法に従って処理を行なうとともに、入力装置、表示装
置、ファイル記憶装置、翻訳装置を制御する。The system device processes according to a page length estimation method, a page header recognition method, a page footer recognition method, a region recognition method, a region type recognition method, a region connection order determination method, a text extraction method, a translation method, and a text format generation method. And the input device, display device, file storage device, and translation device.

【００１４】ページ長推定方法は、入力ファイルの先頭
から、一部または全部の行を読み込んで、読み込んだ各
行に先頭から行番号を付けて、ある行とその行よりもフ
ァイルの末尾側にある行との一致度を計算する。その一
致度を計算した２行の行番号の差を行オフセットとし
て、一致度が一定値以上の２行の組について一致度と行
オフセットの組を蓄積する操作を各行について行ない、
その蓄積結果を行一致度計算結果とする。また、その行
一致度計算結果における行オフセットの頻度を数えて、
行オフセットをページ長と仮定した場合に読み込み行数
から計算されるページ数に対する行オフセットの頻度の
比が最も大きいような行オフセットをページ長と推定す
る。The page length estimation method reads some or all of the lines from the beginning of the input file, assigns a line number from the beginning to each read line, and puts a line and the end of the file from the end. Calculates the degree of coincidence with a row. The difference between the line numbers of the two lines for which the degree of coincidence is calculated is used as the row offset, and the operation of accumulating the pair of the degree of coincidence and the row offset is performed for each row for which the degree of coincidence is a certain value or more.
The accumulated result is used as the row coincidence degree calculation result. In addition, counting the frequency of row offset in the row matching degree calculation result,
When the row offset is assumed to be the page length, the page offset is estimated to be the row offset having the largest ratio of the frequency of the row offset to the number of pages calculated from the number of read rows.

【００１５】ページヘッダー認識方法は、行一致度計算
結果における行オフセットの中にページ長と等しい行オ
フセットがある行で、ページの開始行からファイルの末
尾に向けて連続する行を、ページヘッダーとする。ま
た、ページヘッダーとして認識した行について各ページ
ごとに一致しないカラムに数字があるとき、そのカラム
をページ番号のフィールドであると認識する。In the page header recognition method, a line having a line offset equal to the page length among the line offsets in the line matching degree calculation result, and a line continuous from the start line of the page toward the end of the file is referred to as the page header. To do. Also, when there is a number in a column that does not match for each page in the line recognized as the page header, that column is recognized as a page number field.

【００１６】ページフッター認識方法は、行一致度計算
結果における行オフセットの中にページ長と等しい行オ
フセットがある行で、ページの終了行からファイルの先
頭に向けて連続する行を、ページフッターとする。ま
た、ページフッターとして認識した行について各ページ
ごとに一致しないカラムに数字があるとき、そのカラム
をページ番号のフィールドであると認識する。In the page footer recognition method, a line having a line offset equal to the page length among the line offsets in the line coincidence calculation result, and a line continuous from the end line of the page toward the beginning of the file is called the page footer. To do. Also, when there is a number in a column that does not match for each page in the line recognized as the page footer, the column is recognized as a page number field.

【００１７】領域認識方法の作用は、次の通りである。The operation of the area recognition method is as follows.

【００１８】入力ファイルの先頭から一部または全部を
読み込み、行長があるカラムに達しないときは、その行
のそのカラムには空白文字があるとして、また、行の終
わりを示す改行文字は空白文字であるとして、各カラム
について空白文字を数えて、読み込んだ行数に対する空
白文字数の比率が一定値以上のカラムが連続する範囲を
段組の境界と仮定する。When part or all of the input file is read from the beginning and the line length does not reach a certain column, it is assumed that there is a space character in that column of the line, and the line feed character that indicates the end of the line is a space character. Assuming that the character is a character, the number of blank characters in each column is counted, and the range where columns in which the ratio of the number of blank characters to the number of read lines is a certain value or more is assumed to be a column boundary.

【００１９】次に、領域認識方法では、推定したページ
長に従って１ページを構成するデータを読み込み、各ペ
ージについて次の通りの処理を行う。Next, in the area recognition method, data forming one page is read according to the estimated page length, and the following processing is performed for each page.

【００２０】あるカラムから別のあるカラムまでの間を
１つの段組であると仮定した場合に、ある行のその範囲
の文字がすべて空白文字であるか、または、その範囲ま
で行の長さが達しないときに、その行の直前と直後でそ
の段組の領域を分割する。If a column from one column to another column is assumed to be a column, all the characters in the range of a line are blank characters, or the length of the line up to the range. When is not reached, the region of the column is divided immediately before and after the line.

【００２１】あるカラムから別のあるカラムまでの間を
１つの段組であると仮定した場合に、ある行で、段組の
境界であると仮定したカラムに空白文字以外の文字があ
るとき、その境界を介して隣接する段組の領域があれば
それら領域はその行の直前で終了し、その行からその隣
接する段組を併せた幅の段組の領域が存在すると仮定し
て領域の認識を続ける。When a column between one column and another column is assumed to be one column, and a line has a character other than a space character in a column which is assumed to be a column boundary, If there is a region of columns adjacent to each other across the boundary, those regions end immediately before that row, and it is assumed that there is a region of columns with the width including the adjacent columns from that line. Continue to recognize.

【００２２】領域認識方法は、以上の様に作用する。The area recognition method operates as described above.

【００２３】領域種別認識方法は、各領域について、
‘＋’、‘−’、‘｜’、‘‖’などの図表を描くため
に多用される文字の数と、その領域の空白文字と改行文
字を除いた全文字数の比が一定値以上である領域を図表
領域であるとし、それ以外の領域をテキスト領域である
と認識する。The area type recognition method is as follows.
If the ratio of the number of characters such as' + ','-',' | ', and'‖' that are frequently used to draw charts and the total number of characters excluding blank characters and line feed characters in the area is a certain value or more. It is recognized that a certain area is a figure area and the other areas are text areas.

【００２４】領域接続順序決定方法の作用は次の通りで
ある。The operation of the area connection order determination method is as follows.

【００２５】入力ファイル内のあるページ内のある領域
に対して上下に隣接する領域がある場合はそれらの領域
は上から下へ接続するものとし、左右に隣接する領域が
ある場合はそれらの領域の左側の最も下に接続する領域
から右の最も上の領域へ接続するものとし、ただしこれ
らの接続を決定する際に既に接続が確定済みの領域に対
しては再び接続はせずその接続決定済みの領域を飛ばし
て接続するとして、結果としてページ内の各領域に一列
の接続順序関係を決定する。If there are areas that are vertically adjacent to an area in a page in the input file, those areas are connected from top to bottom, and if there are areas that are horizontally adjacent, those areas are connected. Area from the bottommost area on the left side to the topmost area on the right side, but when determining these connections, the areas that have already been established are not reconnected and the connection is determined. Assuming that the already-existing areas are skipped and connected, one row of connection order relationships is determined for each area in the page.

【００２６】次に、領域接続順序決定方法は、入力ファ
イル内のあるページ内の領域の接続順序が末尾の領域か
ら次のページの最も左側で上側の領域へ接続するものと
する。Next, in the area connection order determination method, it is assumed that an area in a page in an input file is connected to an area at the end of the next page to the uppermost area on the leftmost side of the next page.

【００２７】次に、テキスト領域の直後に図表領域が接
続する箇所のすべてについてそのテキスト領域はその図
表領域を飛ばして次の領域に接続するものとし、飛ばさ
れた図表領域については元の順序を保ってそれら図表領
域の接続順序を決定し、結果として２列の接続順序を決
定する。Next, immediately after the text area, for all places where the chart area is connected, the text area is skipped and connected to the next area, and for the skipped chart area, the original order is changed. The connection order of those chart areas is determined, and as a result, the connection order of the two columns is determined.

【００２８】次に、あるテキスト領域の直後に別のテキ
スト領域が接続するとき、その２つのテキスト領域の間
で使用単語頻度分布の違いが大きく、領域の境界の単語
の接続が適切でなく、かつ、その別のテキスト領域より
も後方で隣接する段組にそのあるテキスト領域と使用単
語頻度分布が類似し、領域の境界の単語の接続が適切な
テキスト領域がある場合に、そのテキスト領域がそのあ
るテキスト領域の直後に接続するように、領域の接続順
序を変更する。Next, when another text area is connected immediately after a certain text area, the difference in the used word frequency distribution between the two text areas is large, and the connection of the words at the boundary of the areas is not appropriate. And if there is a text area similar to the text area in the adjacent column behind the other text area, and the word distribution of the word at the boundary of the area is appropriate, that text area is Change the connection order of the areas so that they are connected immediately after the certain text area.

【００２９】領域接続順序決定方法の作用は以上の通り
である。The operation of the area connection order determination method is as described above.

【００３０】テキスト抽出方法は、一連の領域からテキ
ストを抽出する際に、テキスト領域であると認識された
領域だけからテキストを抽出して、領域の接続順序にし
たがってそのテキストを連結し、テキストファイルを生
成する。図表領域からは、各領域ごとにテキストを抽出
して、それぞれテキストファイルを生成する。In the text extraction method, when extracting text from a series of areas, the text is extracted only from the area recognized as the text area, the texts are connected according to the connection order of the areas, and the text file is extracted. To generate. Text is extracted from the chart area for each area to generate a text file.

【００３１】翻訳不要行認識方法は、各行について、
‘＋’、‘−’、‘×’、‘÷’、‘＾’、‘±’、等
号、不等号、‘Σ’などの数式を記述するために多用さ
れる文字や、‘｜’、‘‖’などの図表を描くために多
用される文字の数と、その行の空白文字と改行文字を除
いた全文字数との比が一定値以上である行を翻訳不要行
であると判定し、その行に翻訳不要指定を付加する。ま
た、この判定の閾値は、その行が属する領域の種別によ
って変更する。The translation unnecessary line recognition method is as follows.
'+', '-', 'X', '÷', '^', '±', equal sign, inequality sign, 'Σ', etc. A line with a ratio of the number of characters frequently used to draw a chart such as'‖ 'and the total number of characters excluding blank characters and newline characters in the line is a certain value or more is determined to be an untranslated line. , Add a translation unnecessary designation to that line. Further, the threshold value for this determination is changed according to the type of the area to which the row belongs.

【００３２】翻訳装置および翻訳方法は、上記の抽出し
たテキストによって生成されたテキストファイルを公知
の方法によって翻訳する。なお、本発明のテキストフォ
ーマット認識生成方法は、機械翻訳だけでなく、抽出し
たテキストや図表に対して種々の変換を施して、元の文
書と同じフォーマットで出力する場合に適用可能であ
る。The translation device and the translation method translate the text file generated by the extracted text by a known method. The text format recognition generation method of the present invention can be applied not only to machine translation but also to the case where various conversions are performed on extracted texts and charts and output in the same format as the original document.

【００３３】テキストフォーマット生成方法は、以下の
通りに作用する。The text format generation method works as follows.

【００３４】まず、接続順序が隣接するテキスト領域が
同じ段組であるときは、その領域を統合する。First, when the text areas adjacent in the connection order have the same column, the text areas are integrated.

【００３５】図表領域のテキストを翻訳した結果が、元
の領域が占める行数、カラム数を上回る場合には、翻訳
した結果を埋め込む領域の行数、カラム数を増加させて
翻訳結果が納まるようにして、また、その増加に伴い、
その領域に隣接する別の領域を移動あるいは縮小してペ
ージに納まるようにする。If the result of translating the text in the diagram area exceeds the number of lines and columns occupied by the original area, increase the number of lines and columns in the area where the translated result is embedded so that the translation result can be accommodated. And, with the increase,
Move or shrink another area adjacent to that area to fit on the page.

【００３６】テキストフォーマット生成方法は、また、
テキストを翻訳した結果が、元のフォーマット済みテキ
ストの領域に納まらない場合には、認識したページヘッ
ダー、ページフッター、段組に準じたフォーマットのペ
ージを新たに生成して、その生成したページの領域に、
翻訳結果が元のフォーマット済みテキストの領域からは
み出す部分を置く。The text format generation method also includes
If the result of translating the text does not fit in the original formatted text area, create a new page with a format that conforms to the recognized page header, page footer, and column, and then create the page area. To
Put the part where the translation result extends beyond the area of the original formatted text.

【００３７】また、テキストを翻訳した結果を、そのテ
キストの元の領域に相当する領域に入れた後にその領域
に余白が生じたときは、その余白に空白文字あるいは改
行文字を入れる。また、あるページのすべての領域に空
白文字あるいは改行文字だけがあるとき、そのページを
削除する。If a blank space is generated in the area after the result of translating the text is put in the area corresponding to the original area of the text, a blank character or a line feed character is inserted in the blank area. Also, if there are only blank characters or newline characters in all areas of a page, that page is deleted.

【００３８】その後、システム装置は、処理結果をファ
イル記憶装置に格納したり、表示装置を介して表示した
りする。After that, the system device stores the processing result in the file storage device or displays it via the display device.

【００３９】[0039]

【実施例】以下、図面を用いて、本発明の実施例を説明
する。Embodiments of the present invention will be described below with reference to the drawings.

【００４０】図１５は、本発明の一実施例に係るテキス
トフォーマット認識生成方法を適用した装置の構成を示
す。本装置は、入力装置１５００１、ファイル記憶装置
１５００２、システム装置１５００３、翻訳装置１５０
０４、および表示装置１５００５を備えている。FIG. 15 shows the configuration of an apparatus to which the text format recognition generation method according to an embodiment of the present invention is applied. This device is an input device 15001, a file storage device 15002, a system device 15003, a translation device 150.
04 and a display device 15005.

【００４１】翻訳を行なわせるときには、原文を入力装
置１５００１により入力し、システム装置１５００３を
介してファイル記憶装置１５００２に入力ファイルとし
て記憶する。原文は、フォーマット済みの文書をテキス
トファイルに変換したものである。システム装置１５０
０３は、この入力ファイルを読み出し、そのテキストフ
ォーマットを認識する。その後、翻訳装置１５００４に
より翻訳処理を行なう。さらに、システム装置１５００
３は、先に認識してある原文のテキストフォーマットと
同等のフォーマットを持つ翻訳結果文書を生成する。When translation is performed, the original text is input by the input device 15001 and stored as an input file in the file storage device 15002 via the system device 15003. The original text is a formatted document converted to a text file. System unit 150
03 reads this input file and recognizes its text format. After that, translation processing is performed by the translation device 15004. Further, the system device 1500
3 generates a translation result document having a format equivalent to the previously recognized text format of the original sentence.

【００４２】以下、本実施例における動作手順を詳しく
説明する。まず、原文のテキストフォーマットの認識お
よび翻訳処理に先立って、入力装置１５００１およびシ
ステム装置１５００３を介してファイル記憶装置１５０
０２に入力ファイル（原文）が格納される。The operation procedure in this embodiment will be described in detail below. First, prior to the recognition and translation processing of the original text format, the file storage device 150 is input via the input device 15001 and the system device 15003.
The input file (original text) is stored in 02.

【００４３】図１に、テキストフォーマット認識方法お
よび翻訳方法を表わすフローチャートを示す。このテキ
ストフォーマット認識方法および翻訳方法は、図１５の
システム装置１５００３において動作する。FIG. 1 is a flow chart showing a text format recognition method and a translation method. The text format recognition method and the translation method operate in the system unit 15003 of FIG.

【００４４】図１６は、ファイル記憶装置１５００２に
格納されている入力ファイルのテキストデータの一例で
ある。説明を理解しやすくするために、この図に示すデ
ータに対して、テキストフォーマットの認識処理および
翻訳処理を行なうことを仮定する。FIG. 16 shows an example of text data of an input file stored in the file storage device 15002. For easy understanding of the explanation, it is assumed that the data shown in this figure is subjected to a text format recognition process and a translation process.

【００４５】図１６のテキストは、行１６００１〜行１
６０４６に示すように、４６行のデータである。付番１
６０４７の‘▽’に示すように、改行文字を‘▽’で示
すものとする。この‘▽’が各行の右端である。各行の
左端は、行１６００２の左端の‘１’の位置である。The text of FIG. 16 is line 16001-line 1
As indicated by 6046, there are 46 rows of data. Number 1
The line feed character is indicated by "▽" as indicated by "▽" in 6047. This '▽' is the right end of each line. The left end of each row is the position of “1” at the left end of row 16002.

【００４６】行１６００１、行１６００３、行１６０１
５、行１６０２１、行１６０２３、行１６０２４、行１
６０２６、行１６０４４、および行１６０４６は、行の
左端に改行文字があり、これらの行には通常の文字デー
タは無い。行１６００４〜行１６０１４、行１６０１６
〜行１６０２０、行１６０２２、行１６０２７〜行１６
０４３、および行１６０４５は、行の左端からこの図に
おいて見える文字の直前まで空白文字が満たされいる。Row 16001, row 16003, row 1601
5, row 16021, row 16023, row 16024, row 1
6026, line 16044, and line 16046 have a line feed character at the left end of the line, and these lines do not have normal character data. Line 16004 to line 16014, line 16016
~ Line 16020, line 16022, line 16027 to line 16
043, and line 16045 are filled with blank characters from the left edge of the line to just before the character visible in this figure.

【００４７】図１６において、テキストを表わす文字は
適当な数字および記号‘：’で図示しているが、これは
説明を簡単にするためであり、実際は各種の文字が用い
られているものとする。In FIG. 16, the characters representing the text are shown by appropriate numbers and symbols ':', but this is for the purpose of simplifying the explanation, and it is assumed that various characters are actually used. .

【００４８】次に、図１を参照して、テキストフォーマ
ット認識方法および翻訳方法の手順を具体的に説明す
る。Next, the procedures of the text format recognition method and the translation method will be specifically described with reference to FIG.

【００４９】まず、ページ長推定ステップ１００１にお
いて、入力ファイルの一部あるいは全部を読み込んで、
入力ファイルのテキストフォーマットのページ長を推定
する。このステップの詳細は、図２、図３、図４を参照
して後述する。次に、ステップ１００２でページヘッダ
ーを認識し、ステップ１００３でページフッターを認識
する。ページヘッダー認識ステップ１００２の詳細は図
５を参照して後述する。ページフッター認識ステップ１
００３の詳細は図６を参照して後述する。First, in page length estimation step 1001, a part or all of the input file is read,
Estimate the page length in text format of the input file. Details of this step will be described later with reference to FIGS. 2, 3, and 4. Next, in step 1002, the page header is recognized, and in step 1003, the page footer is recognized. Details of the page header recognition step 1002 will be described later with reference to FIG. Page footer recognition step 1
Details of 003 will be described later with reference to FIG.

【００５０】次に、ステップ１００４で段組などの領域
を認識し、ステップ１００５で領域の種別を認識し、ス
テップ１００６で領域の接続順序を決定する。領域認識
ステップ１００４の詳細は、図７、図８を参照して後述
する。領域種別認識ステップ１００５の詳細は、図９を
参照して後述する。領域接続順序決定ステップ１００６
の詳細は、図１０、図１１を参照して後述する。Next, in step 1004, an area such as a column is recognized, in step 1005 the area type is recognized, and in step 1006 the connection order of areas is determined. Details of the area recognition step 1004 will be described later with reference to FIGS. 7 and 8. Details of the area type recognition step 1005 will be described later with reference to FIG. 9. Area connection order determination step 1006
Will be described later in detail with reference to FIGS. 10 and 11.

【００５１】次に、ステップ１００７でテキストおよび
図表を抽出する。テキスト、図表抽出ステップ１００７
の詳細は、図１２を参照して後述する。次に、ステップ
１００８で翻訳不要部分を認識し、ステップ１００９で
機械翻訳処理を行なう。そして、ステップ１０１０でテ
キストフォーマットを生成し、機械翻訳の結果にフォー
マットを付与して、処理を終了する。翻訳不要部分認識
処理ステップ１００８の詳細は、図１３を参照して後述
する。テキストフォーマット生成処理の詳細は、図１４
を参照して後述する。Next, in step 1007, the texts and figures are extracted. Text and chart extraction step 1007
Will be described later in detail with reference to FIG. Next, in step 1008, the translation unnecessary portion is recognized, and in step 1009, machine translation processing is performed. Then, in step 1010, a text format is generated, the format is added to the result of the machine translation, and the process ends. Details of the translation unnecessary portion recognition processing step 1008 will be described later with reference to FIG. Details of the text format generation process are shown in FIG.
Will be described later with reference to.

【００５２】次に、図２、図３および図４を参照して、
図１のステップ１００１のページ長推定ステップについ
て説明する。Next, referring to FIGS. 2, 3 and 4,
The page length estimation step of step 1001 of FIG. 1 will be described.

【００５３】まず、図２のステップ２００１において、
既入力行群バッファを空にする。この既入力行群バッフ
ァとは、入力ファイルをファイル記憶装置１５００２か
らシステム装置１５００３に読み込んで蓄積するバッフ
ァである。なお、簡単のため、図では「バッファ」とい
う語を省略し、単に「既入力行群」と記載してある。他
のバッファやカウンタなどについても同様とする。First, in step 2001 of FIG.
Empty the input line group buffer. The already-input-line-group buffer is a buffer that reads an input file from the file storage device 15002 and stores it in the system device 15003. For the sake of simplicity, the word "buffer" is omitted in the figure and is simply described as "already input row group". The same applies to other buffers and counters.

【００５４】次に、ステップ２００２において、現入力
行番号カウンタの値を０に設定する。現入力行番号カウ
ンタとは、既入力行群に蓄積するデータの行に番号を付
けるためのカウンタである。次に、ステップ２００３に
おいて、行一致度計算結果バッファを空にする。行一致
度計算結果バッファとは、ページ長を推定するために様
々な行の組について一致度を計算した結果を蓄積するバ
ッファである。Next, in step 2002, the value of the current input line number counter is set to 0. The current input line number counter is a counter for numbering the lines of the data accumulated in the already input line group. Next, in step 2003, the row matching score calculation result buffer is emptied. The row matching score calculation result buffer is a buffer for accumulating the matching score calculation results for various pairs of rows in order to estimate the page length.

【００５５】続いて、図３のステップ３００１〜ステッ
プ３００６において、入力ファイルの行を読み込みなが
ら、行の組について一致度を計算する。Subsequently, in steps 3001 to 3006 of FIG. 3, the degree of coincidence is calculated for the set of rows while reading the rows of the input file.

【００５６】まず、判定ステップ３００１において、処
理中の入力ファイルの始めの３００行を既に処理した
か、ファイルの終わりに達したかを調べる。３００行と
したのは、３００行程度読み込めばほとんどの場合ペー
ジの推定が可能であるからである。現時点では、まだ１
行も読み込んでいないのでこの条件は成立せず、ステッ
プ３００２に進む。First, in decision step 3001, it is checked whether the first 300 lines of the input file being processed have already been processed or the end of the file has been reached. The reason why the number of lines is 300 is that it is possible to estimate the page in most cases by reading about 300 lines. At the moment, still 1
Since no line has been read, this condition is not satisfied and the process proceeds to step 3002.

【００５７】ステップ３００２において、入力ファイル
から１行を読み込んで、そのデータを現入力行バッファ
に格納する。現時点では、図１６の行１６００１のデー
タが現入力行となる。続くステップ３００３において、
現入力行番号カウンタの値に１を加える。現時点では、
現入力行番号カウンタの値は０であるから、このステッ
プ３００３において、その値は１となる。In step 3002, one line is read from the input file and the data is stored in the current input line buffer. At this point, the data in row 16001 of FIG. 16 is the current input row. In the following step 3003,
Add 1 to the value of the current input line number counter. At the moment,
Since the value of the current input line number counter is 0, its value becomes 1 in this step 3003.

【００５８】次のステップ３００４において、既入力行
群バッファに格納されている各行と現入力行バッファに
格納されている現入力行との一致度を計算する。現時点
では既入力行群バッファは空であるから、このステップ
における一致度の計算は行われない。一致度は計算して
いないので、一致度を行一致度計算結果バッファに追加
することも無い。At the next step 3004, the degree of coincidence between each row stored in the already-input row group buffer and the current input row stored in the current input row buffer is calculated. Since the input line group buffer is empty at this time, the degree of coincidence is not calculated in this step. Since the degree of coincidence is not calculated, the degree of coincidence is not added to the line coincidence degree calculation result buffer.

【００５９】次のステップ３００５において、現入力行
番号カウンタの現入力行番号と現入力行バッファの現入
力行データである（現入力行番号，現入力行）の組デー
タを既入力行群バッファに追加する。そして、ステップ
３００１に進む。At the next step 3005, the set data of the current input line number of the current input line number counter and the current input line data of the current input line buffer (current input line number, current input line) is converted into the already input line group buffer. Add to. Then, the process proceeds to step 3001.

【００６０】現時点ではステップ３００１を経て、ステ
ップ３００２に進む。At present, the process proceeds to step 3002 through step 3001.

【００６１】ステップ３００２において、入力ファイル
から次の１行を現入力行バッファに読み込んで、そのデ
ータを現入力行とする。現時点では、図１６の行１６０
０２のデータが現入力行となる。続くステップ３００３
において、現入力行番号カウンタの値に１を加える。現
時点では、現入力行番号の値は１であるから、このステ
ップ３００３において、その値は２となる。In step 3002, the next line from the input file is read into the current input line buffer and the data is set as the current input line. At this point, line 160 of FIG.
The data of 02 becomes the current input line. Continued Step 3003
At 1, add 1 to the value of the current input line number counter. Since the value of the current input line number is 1 at this point, the value is 2 in this step 3003.

【００６２】次のステップ３００４において、既入力行
群バッファに格納されている行と現入力行との一致度を
計算する。現時点では既入力行群バッファには行１６０
０１が格納されているから、このステップでは、現入力
行すなわち行１６００２と行１６００１の一致度を計算
して、その値を行一致度バッファに設定する。At the next step 3004, the degree of coincidence between the row stored in the already-input row group buffer and the current input row is calculated. At the moment, the already input line group buffer has 160 lines.
Since 01 is stored, in this step, the degree of coincidence between the current input row, that is, row 16002 and row 16001 is calculated, and that value is set in the row coincidence degree buffer.

【００６３】行の一致度とは、２つの行について、同一
カラムが同じ文字であるカラム数をこの２つの行の長さ
の平均で割った結果と定義する。行１６００１と行１６
００２では、同一カラムが同じ文字であるカラムは無い
ので一致度は０であり、この値を行一致度バッファに設
定する。一致度の値が０なので、行一致度計算結果バッ
ファにデータを追加することは無い。本実施例では、一
致度の値が０．７５以上の場合に、このステップで行一
致度計算結果バッファにデータを追加するものとする。The degree of coincidence between lines is defined as the result of dividing the number of columns in which the same character is the same in two lines by the average of the lengths of the two lines. Lines 16001 and 16
In 002, since there is no column in which the same column has the same character, the matching degree is 0, and this value is set in the row matching degree buffer. Since the matching degree value is 0, no data is added to the row matching degree calculation result buffer. In this embodiment, when the value of the matching score is 0.75 or more, data is added to the row matching score calculation result buffer in this step.

【００６４】次のステップ３００５において、（現入力
行番号，現入力行）の組データを既入力行群バッファに
追加する。この結果、既入力行群バッファには、行１６
００１と行１６００２と、これらの行の行番号が格納さ
れる。そして、ステップ３００１に進む。At the next step 3005, the set data of (current input line number, current input line) is added to the already input line group buffer. As a result, in the already input row group buffer, the row 16
001, row 16002, and row numbers of these rows are stored. Then, the process proceeds to step 3001.

【００６５】ここで再びステップ３００１を経て、ステ
ップ３００２に進む。Here, the process again proceeds to step 3002 through step 3001.

【００６６】ステップ３００２において、入力ファイル
から次の１行を現入力行バッファに読み込んで、そのデ
ータを現入力行とする。現時点では、図１６の行１６０
０３のデータが現入力行となる。続くステップ３００３
において、現入力行番号カウンタの値に１を加える。現
時点では、現入力行番号の値は２であるから、このステ
ップ３００３において、その値は３となる。In step 3002, the next line from the input file is read into the current input line buffer and the data is set as the current input line. At this point, line 160 of FIG.
The data of 03 becomes the current input line. Continued Step 3003
At 1, add 1 to the value of the current input line number counter. Since the value of the current input line number is 2 at this point, the value is 3 in this step 3003.

【００６７】次のステップ３００４において、既入力行
群バッファに格納されている行と現入力行との一致度を
計算する。現時点では既入力行群バッファには行１６０
０１、行１６００２が格納されているから、このステッ
プでは、これらの行と現入力行すなわち行１６００３と
の一致度をそれぞれ計算して、その値を行一致度バッフ
ァに設定する。At the next step 3004, the degree of coincidence between the row stored in the already-input row group buffer and the current input row is calculated. At the moment, the already input line group buffer has 160 lines.
Since 01 and row 16002 are stored, in this step, the degree of coincidence between these rows and the current input row, that is, row 16003 is calculated, and that value is set in the row coincidence degree buffer.

【００６８】行１６００１と行１６００３との比較で
は、左端のカラムが同じ改行文字であるから、一致度は
１であり、この値を一致度バッファに設定する。なお、
改行文字は１文字とカウントするものとする。行１６０
０１と行１６００３との２行は一致度が一定値０．７５
以上であるから、行オフセットは、行１６００１の行番
号１と行１６００３の行番号３の差の２である。そこ
で、（既入力行番号，行オフセット，一致度）の３つの
データの組、すなわち（１，２，１）を行一致度計算結
果バッファに追加する。In the comparison between line 16001 and line 16003, since the leftmost column has the same line feed character, the matching degree is 1, and this value is set in the matching degree buffer. In addition,
The line feed character shall be counted as one character. Row 160
01 and the row 16003 have a constant degree of coincidence of 0.75.
As described above, the row offset is 2 which is the difference between the row number 1 of the row 16001 and the row number 3 of the row 16003. Therefore, three sets of data (already input row number, row offset, matching degree), that is, (1, 2, 1) are added to the row matching degree calculation result buffer.

【００６９】図１７に、行一致度計算結果バッファの格
納データの例を示す。１７００７に、いま追加された組
データ（１，２，１）を示す。この図１７では、計算過
程および結果がわかり易いように、既入力行番号１７０
０２、行オフセット１７００３、および一致度１７００
６の他に、通し番号１７００１、一致カラム数１７００
４、および行長（２行の行長の平均）１７００５を同時
に示してある。FIG. 17 shows an example of data stored in the row coincidence degree calculation result buffer. 17007 shows the group data (1,2,1) just added. In FIG. 17, the input line number 170
02, row offset 17003, and degree of coincidence 1700
In addition to 6, serial number 17001, number of matching columns 1700
4 and line length (average of line lengths of two lines) 17005 are shown at the same time.

【００７０】同じステップ３００４では、現入力行であ
る行１６００３と既入力行群バッファにあるすべての行
との一致度を計算するので、行１６００２と行１６００
３の一致度を次に計算する。この２行には同一カラムで
同じ文字があるカラムは無いので、一致度は０である。
したがって、この２行については図１７の行一致度計算
結果には何も追加しない。In the same step 3004, the degree of coincidence between the current input row 16003 and all the rows in the already-input row group buffer is calculated, so rows 16002 and 1600 are calculated.
The degree of agreement of 3 is then calculated. Since there is no column with the same character in these two lines, the degree of coincidence is 0.
Therefore, regarding these two lines, nothing is added to the line coincidence degree calculation result of FIG.

【００７１】次のステップ３００５において、（現入力
行番号，現入力行）の組データを既入力行群バッファに
追加する。この結果、既入力行群バッファには、行１６
００１、行１６００２、および行１６００３と、これら
行の行番号が格納される。そして、ステップ３００１に
進む。In the next step 3005, the set data of (current input line number, current input line) is added to the already input line group buffer. As a result, in the already input row group buffer, the row 16
001, row 16002, row 16003, and row numbers of these rows are stored. Then, the process proceeds to step 3001.

【００７２】以下、同様にして行を読み込みながら既入
力行群バッファの各行と現入力行との一致度を計算し、
一致度が０．７５以上であるものについて、図１７の行
一致度計算結果にデータを追加する。図１７は、図１６
の入力ファイルから上述したように行一致度を計算した
結果を示す。Similarly, while reading the lines, the degree of coincidence between each line of the already input line group buffer and the current input line is calculated,
Data having a matching degree of 0.75 or more is added to the row matching degree calculation result of FIG. 17 is the same as FIG.
The result of calculating the degree of line matching as described above from the input file is shown below.

【００７３】図１６の行１６００４６を読み込んで、ス
テップ３００４、およびステップ３００５を経た後に再
び判定ステップ３００１に到達すると、ファイルの終わ
りに達したので、ステップ３００６に進む。ステップ３
００６において、実際に読み込んだ行数を読み込み行数
カウンタに設定する。現時点では、その値は４６であ
る。When the line 160046 in FIG. 16 is read, and the process reaches the decision step 3001 again after the steps 3004 and 3005, the end of the file has been reached, so the process advances to step 3006. Step 3
At 006, the number of lines actually read is set in the read line number counter. At present, its value is 46.

【００７４】次に、図４のステップ４００１に進む。ス
テップ４００１において、ページ長推定計算結果バッフ
ァを空にする。ページ長推定計算結果バッファとは、行
一致度計算結果を統計的に処理した結果を格納するバッ
ファである。Next, the process proceeds to step 4001 in FIG. In step 4001, the page length estimation calculation result buffer is emptied. The page length estimation calculation result buffer is a buffer that stores the result of statistically processing the line matching degree calculation result.

【００７５】ステップ４００２において、行一致度計算
結果（図１７）について以下の処理を行う。まず、行オ
フセットが等しい組データの数を行オフセット頻度とす
る。読み込み行数を行オフセット頻度で割った結果を期
待ページ数とし、行オフセット頻度を期待ページ数で割
った結果を信頼度とする。各行オフセットの値ごとに、
（行オフセット，行オフセット頻度，期待ページ数，信
頼度）を組とするデータを作成し、ページ長計算結果バ
ッファに追加する。現時点での、このステップの結果を
信頼度の降順に整列した結果を図１８に示す。In step 4002, the following processing is performed on the row coincidence degree calculation result (FIG. 17). First, the number of sets of data having the same row offset is taken as the row offset frequency. The expected number of pages is the result of dividing the number of read rows by the row offset frequency, and the result is the result of dividing the row offset frequency by the expected number of pages. For each row offset value,
Creates a set of data (row offset, row offset frequency, expected number of pages, reliability) and adds it to the page length calculation result buffer. FIG. 18 shows the results obtained by arranging the results of this step in descending order of reliability at the present time.

【００７６】次のステップ４００３において、ページ長
の推定結果を決定する。つまり、図１８のページ長推定
計算結果について、信頼度が１以上であり最大である組
データの行オフセットの値を、ページ長バッファに設定
する。信頼度の最大値が１未満であるときは読み込み行
数の値をページ長バッファに設定する。現時点では、１
８００１に示すように信頼度の最大値は３であるから、
その組データの行オフセットの値２３をページ長とす
る。In step 4003, the page length estimation result is determined. That is, regarding the page length estimation calculation result of FIG. 18, the value of the row offset of the group data having the reliability of 1 or more and the maximum is set in the page length buffer. When the maximum reliability is less than 1, the number of read rows is set in the page length buffer. Currently 1
Since the maximum value of reliability is 3 as shown in 8001,
The row offset value 23 of the group data is set as the page length.

【００７７】次のステップ４００４において、既入力行
群バッファの各行のうち最長の行長を有するものを検出
しその行長を最大行長バッファに設定する。このとき、
行の長さには行末の改行文字を含める。現時点では、図
１６を見れば分かるように、最大行長は５３である。In the next step 4004, the line having the longest line length is detected from among the lines in the already-input line group buffer, and the line length is set in the maximum line length buffer. At this time,
The line length includes the line feed character at the end of the line. At present, as can be seen from FIG. 16, the maximum line length is 53.

【００７８】以上で図１のページ長推定ステップ１００
１の動作を終わる。As described above, the page length estimation step 100 of FIG.
The operation of 1 ends.

【００７９】次に、ページヘッダー認識ステップ１００
２に進む。図５を参照して、このステップの詳細を説明
する。まずステップ５００１において、既入力行群バッ
ファの先頭から順に下向きに各行を見て、行一致度計算
結果（図１７）内に、それらの行の行番号が既入力行番
号と等しくてその行オフセットの値がページ長と等しい
組データがあって、また、既入力行群バッファの先頭か
らそのような行が連続する範囲を検出する。検出された
行をページヘッダーであると認識して、ページヘッダー
格納領域にコピーする。以下、このコピーして格納され
たデータを単にページヘッダーと呼ぶ。Next, the page header recognition step 100.
Go to 2. The details of this step will be described with reference to FIG. First, in step 5001, each line is viewed downward from the beginning of the already-input-line group buffer, and the line number of those lines is equal to the already-input line number in the line coincidence calculation result (FIG. 17). There is group data whose value is equal to the page length, and a range in which such lines continue from the head of the already-input line group buffer is detected. Recognize the detected line as a page header and copy it to the page header storage area. Hereinafter, this copied and stored data is simply referred to as a page header.

【００８０】さらに、ページヘッダーとして認識した各
行と、ページ長に等しい行数分だけ下の行とを比較し、
文字が一致しないカラムに数字があるときはそのカラム
をページ数であると認識して、ページヘッダーのそのカ
ラムの文字を‘＄’にする。Further, each line recognized as the page header is compared with the lines below by the number of lines equal to the page length,
When there is a number in a column where characters do not match, that column is recognized as the number of pages, and the character in that column of the page header is set to '$'.

【００８１】現時点では、既入力行群バッファの先頭か
ら下向きに見て行が連続する範囲で、行一致度計算結果
の中の行オフセットの値がページ長の値２３と等しいも
のには、図１７の１７００８、１７００９、１７０１０
がある。１７００８の行番号は１であるからこれは図１
６の行１６００１を示す。１７００９の行番号は２であ
るからこれは行１６００２を示す。１７０１０の行番号
は３であるからこれは行１６００３を示す。したがっ
て、この３行をページヘッダーであると認識してページ
ヘッダー格納領域にコピーする。At this point, if the line offset value in the line coincidence calculation result is equal to the page length value 23 in the range where the lines are continuous from the beginning of the already-input line group buffer when viewed downward, 17 17008, 17009, 17010
There is. Since the line number of 17008 is 1, this is shown in FIG.
6 row 16001 is shown. This indicates row 16002 because the row number of 17009 is 2. Since the row number of 17010 is 3, this indicates the row 16003. Therefore, these three lines are recognized as a page header and copied to the page header storage area.

【００８２】次に、この同じステップ５００１でこれら
ページヘッダーの各行と、ページ長の値２３に等しい行
数だけ下の行と比較する。すなわち行１６００１と行１
６０２４を比較し、行１６００２と行１６０２５を比較
し、行１６００３と行１６０６を比較する。この結果、
行１６００２と行１６０２５の間で、４１カラム目が異
なっておりそのカラムの文字は数字であるから、これを
ページ数であると認識して、ページヘッダーのこのカラ
ムの文字を‘＄’に変更する。Next, in this same step 5001, each line of these page headers is compared with the line below by the number of lines equal to the page length value 23. Ie line 16001 and line 1
6024 is compared, row 16002 and row 16025 are compared, and row 16003 and row 1606 are compared. As a result,
Since the 41st column is different between line 16002 and line 16025 and the character in that column is a number, it is recognized as the number of pages and the character in this column in the page header is changed to '$'. To do.

【００８３】続くステップ５００２において、既入力行
群でページヘッダーとして認識された行の次の行の行番
号をテキスト上限番号として設定する。現時点では、ペ
ージヘッダーとして認識された行は、行１６００１、行
１６００２、行１６００３であるから、その次の行１６
００４の行番号４をテキスト上限番号として設定する。In the following step 5002, the line number of the line next to the line recognized as the page header in the already input line group is set as the text upper limit number. At this point, the lines recognized as the page header are the line 16001, the line 16002, and the line 16003.
The line number 4 of 004 is set as the text upper limit number.

【００８４】以上で図１のページヘッダー認識ステップ
１００２の動作を終わる。This is the end of the operation of the page header recognition step 1002 shown in FIG.

【００８５】次に、ページフッター認識ステップ１００
３に進む。図６を参照して、このステップの詳細を説明
する。このページフッター認識ステップにおける動作
は、図５のページヘッダー認識ステップ１００２の動作
手順とほぼ同等である。Next, the page footer recognition step 100.
Go to 3. The details of this step will be described with reference to FIG. The operation in this page footer recognition step is almost the same as the operation procedure in the page header recognition step 1002 in FIG.

【００８６】まずステップ６００１において、既入力行
群の中のページ長と等しい行番号の行をページ末尾行と
する。現時点では、行１６０２３がページ末尾行とな
る。次に、ページ末尾行から順に上向きに各行を見て、
行一致度計算結果（図１７）内に、それらの行の行番号
が既入力行番号と等しくてその行オフセットの値がペー
ジ長と等しい組データがあって、また、ページ末尾行か
らそのような行が連続する範囲を検出する。検出された
行をページフッターであると認識して、ページフッター
格納領域にコピーする。以下、このコピーして格納され
たデータを単にページフッターと呼ぶ。First, in step 6001, the line having a line number equal to the page length in the already input line group is set as the page end line. At present, the line 16023 is the last line of the page. Next, look at each line upwards from the last line of the page,
In the row matching degree calculation result (FIG. 17), there is group data in which the row numbers of those rows are equal to the already input row numbers and the row offset values thereof are equal to the page length, and from the last row of the page, Detect a range of continuous lines. Recognize the detected line as a page footer and copy it to the page footer storage area. Hereinafter, this copied and stored data is simply referred to as a page footer.

【００８７】現時点では、ページ末尾行から上向きに見
て行が連続する範囲で、行一致度計算結果の中の行オフ
セットの値がページ長の値２３と等しいものには、図１
７の１７０１１、１７０１２、１７０１３がある。１７
０１１の行番号は２１であるからこれは行１６０２１を
示す。１７０１２の行番号は２２であるからこれは行１
６０２２を示す。１７０１３の行番号は２３であるから
これは行１６０２３を示す。したがって、この３行をペ
ージフッターであると認識してページフッター格納領域
にコピーする。At the present time, if the line offset value in the line coincidence degree calculation result is equal to the page length value 23 in the range where the lines are continuous from the end line of the page as viewed upward,
7 17011, 17012, 17013. 17
Since the line number of 011 is 21, this indicates the line 16021. The line number of 17012 is 22, so this is line 1
6022 is shown. Since the line number of 17013 is 23, this indicates the line 16023. Therefore, these three lines are recognized as a page footer and copied to the page footer storage area.

【００８８】次に、この同じステップ６００１でページ
フッターとして認識した各行と、ページ長に等しい行数
分だけ下の行とを比較し、文字が一致しないカラムに数
字があるときはそのカラムをページ数であると認識し
て、ページフッターのそのカラムの文字を‘＄’にす
る。Next, each line recognized as a page footer in the same step 6001 is compared with the line below by the number of lines equal to the page length. If there is a number in a column in which the characters do not match, the column is paged. Recognize it as a number and set the character in that column of the page footer to '$'.

【００８９】現時点では、ページフッターの各行と、ペ
ージ長の値２３に等しい行数だけ下の行と比較する。す
なわち行１６０２１と行１６０４４を比較し、行１６０
２２と行１６０４５を比較し、行１６０２３と行１６０
４６を比較する。この結果、行１６０２２と行１６０４
５の間で、２２カラム目が異なっておりそのカラムの文
字は数字であるから、これをページ数であると認識し
て、ページフッターのこのカラムの文字を‘＄’に変更
する。At this time, each line of the page footer is compared with the line below by the number of lines equal to the page length value 23. That is, row 16021 and row 16044 are compared and row 160
22 and row 16045 are compared, and row 16023 and row 160
Compare 46. As a result, row 16022 and row 1604
Since the 22nd column is different between 5 and the character in that column is a number, it is recognized as the number of pages, and the character in this column in the page footer is changed to '$'.

【００９０】続くステップ６００２において、既入力行
群でページフッターとして認識された行の直前の行の行
番号をテキスト下限番号に設定する。現時点では、ペー
ジフッターとして認識された行は、行１６０２１、行１
６０２２、行１６０２３であるからその直前の行１６０
２０の行番号２０をテキスト下限番号に設定する。In the following step 6002, the line number of the line immediately before the line recognized as the page footer in the already input line group is set as the text lower limit number. Currently, the lines recognized as page footers are line 16021, line 1
6022, line 16023, so the line 160 immediately before it
Set line number 20 of 20 to the text lower limit number.

【００９１】以上でページフッター認識ステップ１００
３の動作を終わる。The page footer recognition step 100 is completed.
The operation of 3 ends.

【００９２】図１９は、上述したページ長、ページヘッ
ダー、およびページフッターの推定および認識の結果を
示す。FIG. 19 shows the results of estimation and recognition of the page length, page header, and page footer described above.

【００９３】次に、領域認識ステップ１００４に進む。
図７および図８を参照して、このステップの詳細を説明
する。Next, the process proceeds to area recognition step 1004.
The details of this step will be described with reference to FIGS. 7 and 8.

【００９４】まず、ステップ７００１において、既入力
行群バッファの各行について、ある行の行長があるカラ
ムに達しない場合にはそのカラムには空白文字があると
見なして、カラムごとに（すなわち図１６でいえば縦方
向に）空白文字を数える。改行文字は空白文字と見な
す。タブ文字は、そのタブを必要な数の空白文字に展開
してあるものとする。次に、ステップ７００２におい
て、既入力行群の各行について、カラムごとの空白文字
の数を読み込み行数で割った結果を空白文字率とする。First, in step 7001, for each line of the already-input line group buffer, if the line length of a certain line does not reach a certain column, it is considered that there is a blank character in that column, and each column (that is, FIG. Count the blank characters (16 vertically). The line feed character is regarded as a space character. The tab character is assumed to have the tab expanded to the required number of white space characters. Next, in step 7002, for each row of the already input row group, the result of dividing the number of blank characters for each column by the number of read rows is defined as the blank character rate.

【００９５】図２０は、現時点のステップ７００１およ
びステップ７００２の処理結果であるカラム別空白文字
率計算結果を空白文字率の降順に整列した結果を示す。FIG. 20 shows the result of arranging the column-by-column blank character ratio calculation results, which are the processing results of the current step 7001 and step 7002, in descending order of the blank character ratio.

【００９６】次に、ステップ７００３において、空白文
字率が一定値以上のカラムが連続する範囲を段組の境界
であると認識して、各段組の開始カラムおよび終了カラ
ムを決定する。その結果を基本段組認識結果として格納
する。Next, in step 7003, the range where columns with a blank character ratio of a certain value or more continue is recognized as the boundary of the column, and the start column and end column of each column are determined. The result is stored as the basic column recognition result.

【００９７】現在の実施例では、空白文字率の一定値を
０．７５以上とする。現時点では、この範囲にありカラ
ムが連続する範囲は、図２０の２０００１に示す５３カ
ラム目と、２０００２〜２０００３に示す１カラム目〜
２カラム目と、２０００４〜２０００５に示す２５カラ
ム目〜２６カラム目が、段組の領域の境界であると認識
できる。したがって、段組は３カラム目〜２４カラム目
と２７カラム目〜５２カラム目の２つである。図２１
は、このようにして認識した結果である基本段組領域認
識結果を示す。In the present embodiment, the constant value of the blank character ratio is 0.75 or more. At present, the range of continuous columns in this range is the 53rd column shown in 20001 and the 1st column shown in 20002 to 20003 of FIG.
It can be recognized that the second column and the 25th to 26th columns shown in 20004 to 20005 are boundaries of the region of the column. Therefore, there are two columns, the third column to the 24th column and the 27th column to the 52nd column. Figure 21
Shows the basic column area recognition result which is the result of recognition in this way.

【００９８】次に、図８のステップ８００１において、
入力ファイルをオープンし直して、再び入力ファイルの
先頭から読み込めるようにする。そして、ステップ８０
０２において、ページ番号カウンタの値を０にする。Next, in step 8001 of FIG.
Reopen the input file so that it can be read from the beginning of the input file again. And step 80
In 02, the value of the page number counter is set to 0.

【００９９】判定ステップ８００３を経て、ステップ８
００４において、ページ番号カウンタの値に１を加え
る。現時点では、ページ番号カウンタの値は１となる。After judgment step 8003, step 8
At 004, 1 is added to the value of the page number counter. At present, the value of the page number counter is 1.

【０１００】次に、ステップ８００５において、入力フ
ァイルからページ長で示される行数を読み込む。この例
では、行１６００１〜行１６０２３を読み込むこととな
る。続くステップ８００６において、この読み込んだ１
ページ分のデータを解析してページ内の領域を認識す
る。Next, at step 8005, the number of lines indicated by the page length is read from the input file. In this example, rows 16001 to 16023 are read. In the following step 8006, this read 1
Recognize the area within the page by analyzing the data for the page.

【０１０１】ステップ８００６においては、ページヘッ
ダー、ページフッター、テキスト上限行番号、テキスト
下限行番号、および基本段組領域認識結果の内容を元に
して、テキストおよび図表の存在する範囲を仮定し、ま
た段組の境界のカラムの範囲を仮定して、ページ内の領
域の認識結果を個別ページ領域認識結果に格納する。こ
こで領域の切れ目の認識は、以下（ｉ）、（ｉｉ）の通
りに行なう。改行文字は空白文字として扱う。In step 8006, based on the contents of the page header, the page footer, the text upper limit line number, the text lower limit line number, and the basic column area recognition result, the existing range of the text and the figure is assumed, and The recognition result of the area in the page is stored in the individual page area recognition result assuming the range of the column at the boundary of the column. Here, the recognition of the break of the area is performed as in (i) and (ii) below. The line feed character is treated as a space character.

【０１０２】（ｉ）ある行である段組が存在すると仮
定した１つのカラムの範囲内の文字（すなわち、その行
に直前行と同じ段組が存在するなら文字があると思われ
る範囲）がすべて空白文字である場合にその行は領域の
切れ目であるとする。つまり、その行の直前に領域があ
ればその領域はその行の直前で終了し、その行の直後か
ら別の領域が始まるものとする。(I) Characters within the range of one column assuming that there is a column in a certain line (that is, a range in which a character is considered to exist if the same column as the immediately preceding line exists in that line) A line is considered to be a break in the area if it is all whitespace characters. That is, if there is a region immediately before the line, the region ends immediately before the line, and another region starts immediately after the line.

【０１０３】（ｉｉ）ある行で段組の境界であると仮
定したカラムが空白文字でない場合には、その行は基本
段組領域認識結果とは異なる領域を構成するものとす
る。つまり、その境界を介して隣接する領域があれば、
その行の直前でそれらの領域は終了して、その行からは
その隣接する領域を合わせた幅の領域が始まると仮定す
る。(Ii) If a column assumed to be a column boundary in a line is not a blank character, the line constitutes an area different from the basic column area recognition result. In other words, if there is a region that is adjacent through that boundary,
It is assumed that the regions end just before the line, and that the line starts a region of the combined width of the adjacent regions.

【０１０４】現時点、すなわち１ページ目の領域認識結
果は、図２２の個別ページ領域認識結果の２２００９、
２２０１０、２２０１１である。これを図形的に記述し
たものを図２３の２３００１に示す。図２２の個別ペー
ジ領域認識結果の、開始行２２００１、終了行２２００
２、開始カラム２２００４、終了カラム２２００５は、
図１６に示す入力ファイルにおける領域の存在範囲を示
すものである。各行のページ２２００３は、現時点での
ページ番号の値である。領域種別２２００６、次領域２
２００７、次接続種別２２００８は、後の処理において
データを入れるので、現時点では空または終端を示す
‘―’である。At the present time, that is, the area recognition result of the first page is 22009 of the individual page area recognition result of FIG.
22010 and 22011. A graphical description of this is shown at 23001 in FIG. The start line 22001 and the end line 2200 of the individual page area recognition result in FIG.
2, start column 22004, end column 22005,
17 is a diagram showing a range of existence of an area in the input file shown in FIG. 16. The page 22003 in each line is the current page number value. Area type 22006, next area 2
The data 2007 and the next connection type 22008 are empty or "-" indicating the end because data is input in the subsequent processing.

【０１０５】ステップ８００６を終了して、再び判定ス
テップ８００３を経て、ステップ８００４に進む。After the step 8006 is completed, the process proceeds to the step 8004 through the judgment step 8003 again.

【０１０６】ステップ８００４において、ページ番号カ
ウンタの値に１を加える。現時点では、その値は２とな
る。At step 8004, 1 is added to the value of the page number counter. At the moment, its value is 2.

【０１０７】次に、ステップ８００５において、入力フ
ァイルからページ長で示される行数、すなわち２３行の
データを読み込む。この例では、行１６０２４〜行１６
０４６を読み込むこととなる。次のステップ８００６に
おいて、この２ページ目のデータを解析して領域を認識
する。このステップの結果は、図２２の２２０１２、２
２０１３、２２０１４、２２０１５に示す。これを図形
的に記述したものを図２３の２３００２に示す。Next, at step 8005, the number of lines indicated by the page length, that is, 23 lines of data is read from the input file. In this example, lines 16024 to 16
046 will be read. In the next step 8006, the data of the second page is analyzed to recognize the area. The result of this step is 22012, 2 in FIG.
2013, 22014, 22015. A graphical description of this is shown at 23002 in FIG.

【０１０８】ステップ８００６を終了して、再び判定ス
テップ８００３に進む。今回は、既に入力ファイルの終
わりに到達しているので、図８のフローチャート、すな
わち領域認識ステップ１００４の動作を終了する。The step 8006 is ended and the procedure goes to the decision step 8003 again. This time, since the end of the input file has already been reached, the operation of the flowchart of FIG. 8, that is, the area recognition step 1004 is ended.

【０１０９】次に、領域種別認識ステップ１００５に進
む。図９を参照して、このステップの詳細を説明する。Next, the processing proceeds to area type recognition step 1005. The details of this step will be described with reference to FIG.

【０１１０】まず、ステップ９００１において、図７，
８のように認識した各領域について、‘＋’、‘―’、
‘｜’、‘‖’など図表を構成する文字の数と、空白文
字と改行文字以外の文字の数との比率を調べて、その結
果が一定値以上である場合は、その領域の種別を図表領
域とする。それ以外をテキスト領域であるとする。First, in step 9001, FIG.
For each region recognized as 8, “+”, “−”,
Check the ratio between the number of characters that make up the chart, such as' | 'and'‖', and the number of characters other than blank characters and newline characters. If the result is a certain value or more, select the area type. Set as the chart area. The rest is a text area.

【０１１１】現在の実施例では、このステップにおける
領域種別のための閾値を０．６とする。現在の入力ファ
イルにおいては、図２２の２２０１１で示される領域、
すなわち図１６の行１６０１６〜行１６０２０の３カラ
ム目〜５２カラム目の領域において、上記の図表を構成
する文字の数が１３４文字であり、空白文字と改行文字
を除く文字の数は、１５５文字であるから、その比率は
０．８６となり、この領域の種別は図表領域とされる。
他の領域には上記の上記の図表を構成する文字は含まれ
ないので、それらの領域の種別はテキスト領域であると
される。In the present embodiment, the threshold for the area type in this step is set to 0.6. In the current input file, the area indicated by 22011 in FIG.
That is, in the region from the third column to the 52nd column of lines 16016 to 16020 in FIG. 16, the number of characters making up the above chart is 134 characters, and the number of characters excluding blank characters and line feed characters is 155 characters. Therefore, the ratio is 0.86, and the type of this area is the chart area.
Since the other areas do not include the characters that make up the above chart, the types of these areas are considered to be text areas.

【０１１２】図２４に、領域の種別を認識した結果を示
す。領域種別２４００１の欄には、２４００２が示す領
域についてだけ「図表」とあり、これは図表領域である
ことを示す。この領域は、図２２において２２０１１で
示したものである。他の領域の領域種別２４００１の欄
には「テキスト」とあり、これはテキスト領域であるこ
とを示す。FIG. 24 shows the result of recognizing the type of area. In the field of area type 24001, only the area indicated by 24002 is shown as "chart", which indicates that it is a chart area. This area is indicated by reference numeral 22011 in FIG. In the area type 24001 of the other area, there is "text", which indicates that the area is a text area.

【０１１３】以上で、領域種別認識ステップ１００５の
動作を終了する。Thus, the operation of the area type recognition step 1005 is completed.

【０１１４】次に、領域接続順序決定ステップ１００６
に進む。図１０および図１１を参照して、このステップ
の詳細を説明する。Next, the area connection order determination step 1006.
Proceed to. The details of this step will be described with reference to FIGS. 10 and 11.

【０１１５】まず、ステップ１０００１において、入力
ファイル内の各ページについて、ページ内のある領域に
対して上下に隣接する領域がある場合はそれらの領域は
上から下へ接続するものとする。左右に隣接する領域が
ある場合は、それらの領域の左側の最も下に接続する領
域から右の最も上の領域へ接続するものとする。この領
域の接続の決定は、上から下へ、左から右へ決定してい
き、既に接続済みの領域に対しては再び接続せずその接
続決定済みの領域を飛ばして接続するとして、ページ内
の各領域に一列の接続順序を決定する。First, in step 10001, for each page in the input file, if there are areas vertically adjacent to a certain area in the page, those areas are connected from top to bottom. If there are areas that are adjacent to each other on the left and right, the areas that connect to the bottom on the left side of these areas are connected to the areas on the right side. The connection of this area is determined from top to bottom and from left to right.It is assumed that the area already connected is skipped and the area is already connected, instead of connecting again. The connection order of one row is determined for each area.

【０１１６】次のステップ１０００２において、入力フ
ァイル内の各ページ間の領域の接続は、あるページ内の
領域の接続順序が末尾の領域から次のページの最も左側
で上側の領域に接続するものとする。In the next step 10002, the connection of the areas between the pages in the input file is such that the connection order of the areas in a page is from the last area to the leftmost and uppermost area of the next page. To do.

【０１１７】図２５は、現在の入力ファイル（図１６）
に対するこのステップ１０００１とステップ１０００２
の認識結果を示す。次領域２５００２の欄の番号は、欄
２５００１の領域番号の値で表わすものとする。これに
よって、この入力ファイル内の各領域の接続順序が一列
に決定される。２５００３の‘―’は、この時点では、
接続順序の終端を表わす。FIG. 25 shows the current input file (FIG. 16).
For this step 10001 and step 10002
The recognition result of is shown. The number of the field of the next area 25002 is represented by the value of the area number of the field 25001. As a result, the connection order of the areas in the input file is determined in a line. The'- 'in 25003 at this point
Indicates the end of the connection order.

【０１１８】図２６は、図２５の接続順序を図形的に記
述したものを示す。FIG. 26 shows a diagrammatic representation of the connection sequence of FIG.

【０１１９】次に、図１１のステップ１１００１に進
む。このステップ１１００１において、領域の一列の接
続順序をたどって、領域の種別ごとに異なる列の接続順
序を作成する。つまり、２列（テキスト領域の列と図表
領域の列）の接続順序を作成する。テキスト領域に関す
る一連の列に含まれる各領域をテキスト主領域と呼び、
その接続順序の先頭をテキスト主領域の先頭領域として
設定する。図表領域に関する一連の列に含まれる各領域
を図表副領域と呼び、その接続順序の先頭を図表副領域
の先頭領域として設定する。各々の列の順序は、元の一
列の接続順序における順序を保存して、順序付けするも
のとする。Next, the process proceeds to step 11001 in FIG. In this step 11001, the connection order of one row of regions is traced, and the connection order of different columns is created for each type of area. That is, a connection order of two columns (a column in the text area and a column in the chart area) is created. Each area included in the series of columns related to the text area is called the text main area,
The head of the connection order is set as the head area of the text main area. Each area included in a series of columns related to a chart area is called a chart subarea, and the head of the connection order is set as the head area of the chart subarea. The order of each column shall be such that the order in the original one column connection order is preserved and ordered.

【０１２０】図２７および図２８の次領域２８００１の
欄に、現在の入力ファイルに対するステップ１１００１
の結果を示す。図２９に、この結果を図形的に記述した
ものを示す。In the next area 28001 column in FIGS. 27 and 28, the step 11001 for the current input file is displayed.
The result is shown. FIG. 29 shows a graphical description of this result.

【０１２１】次のステップ１１００２において、あるテ
キスト領域の直後に別のテキスト領域が接続するとき、
その２つの領域の間で使用単語頻度分布の違いが大き
く、領域の境界の単語の接続が適切でなく、かつ、その
後接するテキスト領域よりも後方で隣接する段組の領域
でそのあるテキスト領域と使用単語頻度分布が類似して
領域境界単語接続が適切であるテキスト領域があるかど
うかを調べる。そして、そのような領域が存在すれば、
その領域をそのある領域の直後に接続するように接続順
序を変更する。In the next step 11002, when another text area is connected immediately after one text area,
The difference in the frequency distribution of the used words between the two areas is large, the words at the boundaries of the areas are not properly connected, and the adjacent text area is behind the adjacent text area. It is checked whether there is a text area with similar usage word frequency distribution and proper area boundary word connection. And if such a region exists,
Change the connection order so that the region is connected immediately after the certain region.

【０１２２】現在の入力ファイルでは、各領域の単語に
ついては考慮していないので、このステップの作用に対
する具体例は省略する。もし、このステップの効果が現
在の入力ファイルに対してあるとすれば、それは、例え
ば図２９の第４の領域２９００４から第６の領域２９０
０６に接続しているところを、第４の領域２９００４か
ら第５の領域２９００５に接続するように変更するもの
である。In the current input file, the words in each area are not considered, so a concrete example of the operation of this step will be omitted. If the effect of this step is on the current input file, it is, for example, the fourth region 29004 to the sixth region 290 of FIG.
The area connected to 06 is changed from the fourth area 29004 to the fifth area 29005.

【０１２３】次のステップ１１００３において、同一ペ
ージ内で領域が基本段組領域認識結果に従い上下で接続
する場合は上側の領域の次接続種別を同一段組とする。
それ以外は、次接続種別を別段組とする。In the next step 11003, when areas are connected vertically in the same page according to the basic column area recognition result, the next connection type of the upper area is set to the same column.
Other than that, the next connection type is a separate set.

【０１２４】図２８の次接続種別２８００２の欄に、現
在の入力ファイルに対するステップ１１００３の結果を
示す。第４の領域から第６の領域への接続２８００３、
第５の領域から第７への領域への接続２８００４が同一
段組である。The next connection type 28002 column in FIG. 28 shows the result of step 11003 for the current input file. Connection from the fourth area to the sixth area 28003,
Connections 28004 from the fifth region to the seventh region are in the same column.

【０１２５】以上で領域接続順序決定ステップ１００６
の動作を終了する。The area connection order determination step 1006 is completed as described above.
Ends the operation.

【０１２６】次に、テキスト・図表抽出ステップ１００
７に進む。図１２を参照して、このステップの詳細を説
明する。まず、ステップ１２００１において、テキスト
主領域が示す領域のリストをたどり、各領域からデータ
を抽出して連結してテキストファイルを作成する。ただ
し、次接続種別が同一段組であるときは連結せず、別の
テキストファイルとする。Next, the text / figure extraction step 100.
Proceed to 7. The details of this step will be described with reference to FIG. First, in step 12001, a list of areas indicated by the text main area is traced, data is extracted from each area and connected to create a text file. However, if the next connection type is the same column, they are not linked and are set as different text files.

【０１２７】現在の入力ファイルでは、図２７、図２
８、および図２９に示すように、まず、第１の領域２９
００１、第２の領域２９００２、および第４の領域２９
００４が次接続種別が別段組で接続しているので、これ
らの領域からデータを抽出して連結してテキストファイ
ルを作成する。その結果を図３０に示す。第１の領域２
９００１は３０００１に対応し、第２の領域２９００２
は３０００２に対応し、第４の領域２９００４は３００
０３に対応する。In the current input file, as shown in FIGS.
8 and FIG. 29, first, the first region 29
001, the second area 29002, and the fourth area 29
Since the next connection type 004 is connected in a different column, data is extracted from these areas and connected to create a text file. The result is shown in FIG. First area 2
9001 corresponds to 30001, and the second area 29002
Corresponds to 30002, and the fourth region 29004 is 300
Corresponds to 03.

【０１２８】第４の領域２９００４から第６の領域２９
００６への接続は、図２８の２８００３に示すように同
一段組なので、第６の領域からは別のファイルに格納す
る。また、第６の領域２９００６と第５の領域２９００
５とは、次接続種別が別段組で接続しているので、この
２つの領域からデータを抽出して連結し、テキストファ
イルを作成する。その結果を図３１に示す。第６の領域
２９００６は３１００１に対応し、第５の領域２９００
５は３１００２に対応する。The fourth area 29004 to the sixth area 29
Since the connection to 006 is the same column as shown by 28003 in FIG. 28, it is stored in a different file from the sixth area. In addition, the sixth area 29006 and the fifth area 2900
Since the next connection type is connected to the column 5 by another column, data is extracted from these two areas and connected to create a text file. The result is shown in FIG. The sixth area 29006 corresponds to 31001, and the fifth area 2900
5 corresponds to 31002.

【０１２９】第５の領域から第７の領域への接続は、同
一段組なので、第７の領域２９００７からは別のファイ
ルに格納する。第７の領域の次は終端なので、第７の領
域だけからデータを抽出してファイルに格納する。その
結果を図３２に示す。Since the connection from the fifth area to the seventh area is the same column, it is stored in a different file from the seventh area 29007. Since the end is after the seventh area, data is extracted from only the seventh area and stored in the file. The result is shown in FIG.

【０１３０】次のステップ１２００２において、図表副
領域のリストをたどり、各領域からデータを抽出して図
表ファイルを作成する。図表副領域のリストの各領域か
ら抽出したデータは、それぞれ、別々の図表ファイルに
する。現在の入力ファイルでは、図表副領域は第３の領
域２９００３を示し、その次は終端であるから、この領
域からデータを抽出して図表ファイルに格納する。その
結果を図３３に示す。In the next step 12002, the list of chart sub-areas is traced, data is extracted from each area, and a chart file is created. The data extracted from each area of the list of chart subareas should be in a separate chart file. In the current input file, the chart sub-area indicates the third area 29003, and the next is the end, so data is extracted from this area and stored in the chart file. The result is shown in FIG.

【０１３１】以上でテキスト・図表抽出ステップ１００
７の動作を終了する。The text / figure extraction step 100 is completed.
The operation of 7 is ended.

【０１３２】次に、翻訳不要部分認識ステップ１００８
に進む。図１３を参照して、このステップの詳細を説明
する。Next, a translation unnecessary portion recognition step 1008.
Proceed to. The details of this step will be described with reference to FIG.

【０１３３】まず、ステップ１３００１において、領域
種別がテキスト領域である領域から抽出した各テキスト
ファイルの各行について、‘＋’、‘−’、‘｜’、
‘‖’など図表を構成する文字や、等号、不等号、
‘／’、‘＊’、‘Σ’、‘±’、‘÷’、‘×’、数
字など、数式を構成する文字の数と、その行の空白文字
と改行文字を除いた文字の数の比率が一定値以上である
行を翻訳不要行であると認識して、その行の翻訳不要指
定を挿入する。First, in step 13001, for each line of each text file extracted from the area whose area type is text area, "+", "-", "|",
Characters that make up a chart such as'‖ ', equal signs, inequality signs,
Number of characters that make up the mathematical expression, such as '/', '*', 'Σ', '±', '÷', 'x', and numbers, and the number of characters excluding whitespace and newline characters in that line A line whose ratio is above a certain value is recognized as a translation-unnecessary line, and the translation-unnecessary designation of that line is inserted.

【０１３４】現在の入力ファイルでは、テキスト領域か
ら抽出したテキストファイルは図３０、図３１、図３２
の３個であるが、このステップの処理の結果、変化があ
るものは図３０のテキストファイルだけであるとする。
ステップ１３００１の処理結果を図３４に示す。３４０
０１、３４００２、３４００３、３４００４に示す行が
翻訳不要行であると認識された行であり、翻訳不要指定
である‘<<’と‘>>’が挿入されている。In the current input file, the text files extracted from the text area are shown in FIG. 30, FIG. 31, and FIG.
It is assumed that only the text file of FIG. 30 changes as a result of the processing of this step.
The processing result of step 13001 is shown in FIG. 340
The lines indicated by 01, 34002, 34003, and 34004 are the lines recognized as the translation unnecessary lines, and the translation unnecessary designations "<<" and ">>" are inserted.

【０１３５】次のステップ１３００２において、領域種
別が図表領域である領域から抽出した各テキストファイ
ルの各行について、‘＋’、‘−’、‘｜’、‘‖’な
ど図表を構成する文字や、等号、不等号、‘／’、
‘＊’、‘Σ’、‘±’、‘÷’、‘×’、数字など、
数式を構成する文字がある行を翻訳不要行であると認識
して、その行に翻訳不要指定を挿入する。現在の入力フ
ァイルでは、図表領域から抽出した図表ファイルは図３
３の図表ファイルだけであり、その結果を図３５に示
す。In the next step 13002, for each line of each text file extracted from the area whose area type is the chart area, the characters that make up the chart such as "+", "-", "|", "|", Equal sign, inequality sign, '/',
'*', 'Σ', '±', '÷', 'x', numbers, etc.
Recognize a line containing characters that make up a mathematical expression as a translation-unnecessary line, and insert a translation-unnecessary specification into that line. In the current input file, the chart file extracted from the chart area is shown in Figure 3.
3 is only the chart file, and the result is shown in FIG.

【０１３６】以上で翻訳不要部分認識ステップ１００８
の動作を終了する。As described above, the translation unnecessary portion recognition step 1008
Ends the operation.

【０１３７】次に翻訳ステップ１００９に進む。このス
テップにおいて、図３４、図３１、図３２に示すテキス
トファイルと図３５に示す図表ファイルをそれぞれ機械
翻訳処理してその結果を別々のファイルに格納する。こ
のステップにおける翻訳処理は公知の翻訳方法によって
動作するので、その詳細の説明は省略する。Next, the process proceeds to translation step 1009. In this step, the text files shown in FIGS. 34, 31, and 32 and the chart file shown in FIG. 35 are machine translated, and the results are stored in separate files. Since the translation process in this step operates according to a known translation method, detailed description thereof will be omitted.

【０１３８】次に、テキストフォーマット生成ステップ
１０１０に進む。図１４を参照して、このステップの詳
細を説明する。Next, the process proceeds to the text format generation step 1010. The details of this step will be described with reference to FIG.

【０１３９】まず、ステップ１４００１において、個別
ページ領域認識結果（図２８）について、テキスト主領
域が示すテキスト領域のリストをたどり、次接続種別が
同一段組である領域を１つの領域に統合する。その結果
をターゲット文書領域生成結果とする。このステップに
より、図２８が図３６に示すように変更される。第４の
領域３６００１と第５の領域３６００２の終了カラム３
６００３の欄の値が変更され、また、統合された第６の
領域２８００５と第７の領域２８００６が削除される。
図３７は、図３６の内容を図形的に記述したものであ
る。First, in step 14001, regarding the individual page area recognition result (FIG. 28), the list of text areas indicated by the text main area is traced, and areas having the same connection type as the next connection type are integrated into one area. The result is used as the target document area generation result. By this step, FIG. 28 is modified as shown in FIG. End column 3 of the fourth area 36001 and the fifth area 36002
The value in the column of 6003 is changed, and the integrated sixth area 28005 and seventh area 28006 are deleted.
37 graphically describes the contents of FIG.

【０１４０】次に、ステップ１４００２において、各図
表ファイルの翻訳結果を、図表副領域で示される各図表
領域の、データの抽出元の領域に埋め込む。このとき、
翻訳結果が埋め込み先の領域に納まらない場合には、埋
め込み先の領域を拡張し、それに応じて同一ページ内の
他の領域を縮小する。現在の入力ファイルに関しては、
第３の領域の抽出結果である図３５の図表ファイルを埋
め込む。現時点の入力ファイルに対する処理では、この
埋め込みに際しては、領域の拡張は必要がないものとす
る。Next, in step 14002, the translation result of each chart file is embedded in the area from which data is extracted in each chart area indicated by the chart sub-area. At this time,
When the translation result does not fit in the embedding destination area, the embedding destination area is expanded and other areas in the same page are reduced accordingly. For the current input file,
The chart file of FIG. 35, which is the extraction result of the third area, is embedded. In the processing on the input file at the present time, it is not necessary to expand the area for this embedding.

【０１４１】次に、ステップ１４００３において、各テ
キストファイルの翻訳結果を、ファイルの境界に空行を
入れて連結して、テキスト主領域で示される領域に、順
に埋め込む。すべてのテキストファイルの翻訳結果を埋
め込んで領域が余るときは、改行文字または空白文字を
満たす。テキスト領域が不足する場合は、ページ長推
定、ページヘッダー認識、ページフッター認識の各結果
および基本段組領域認識結果に沿って新しいページと段
組領域を生成する。ページヘッダー、ページフッターに
ページ番号指定‘＄’があれば、生成したページのペー
ジ番号指定を、適切なページ番号に置き換える。Next, in step 14003, the translation results of each text file are concatenated by putting blank lines at the file boundaries and embedded in the area indicated by the text main area in order. When the translation results of all text files are embedded and there is a space, fill the newline character or the space character. If the text area is insufficient, a new page and column area are generated according to the results of page length estimation, page header recognition, page footer recognition, and the basic column area recognition result. If there is a page number designation '$' in the page header and page footer, replace the page number designation of the generated page with an appropriate page number.

【０１４２】図３８および図３９は、現在の入力ファイ
ル（図１６）に対するこれらのステップの結果を示す。
これは、第３のページ３９００１に示すように、新たな
ページを生成した場合の例である。38 and 39 show the results of these steps for the current input file (FIG. 16).
This is an example of the case where a new page is generated as shown in the third page 39001.

【０１４３】次に、ステップ１４００４において、すべ
ての領域が改行文字あるいは空白文字だけで満たされて
いるページを削除する。現在の例では、図３８および図
３９に示すように、該当するページは無いので、このス
テップによる変化は無い。Next, in step 14004, the page in which all the areas are filled with only the line feed character or the blank character is deleted. In the present example, as shown in FIGS. 38 and 39, since there is no corresponding page, there is no change in this step.

【０１４４】以上でテキストフォーマット生成ステップ
１０１０の動作を終了し、したがって、図１のテキスト
フォーマット認識生成方法（および機械翻訳処理）の動
作を終了する。As described above, the operation of the text format generation step 1010 is completed, and thus the operation of the text format recognition generation method (and machine translation process) of FIG. 1 is completed.

【０１４５】以上で本発明の一実施例の説明を終わる。This is the end of the description of the embodiment of the present invention.

【０１４６】次に、上記実施例に対する変形例を説明す
る。上記実施例では、領域を認識する際の領域の切れ目
の条件に、領域があると仮定した範囲がすべて空白文字
である場合を挙げた。しかし、これに加えて、領域があ
ると仮定した範囲である行の左側が一定以上の個数の空
白文字である場合に、その行の直前で領域の切れ目とす
る条件を加えることができる。これにより、つまり段落
の始めに字下げがある場合に、その直前の行で領域を分
割することができる。Next, a modification of the above embodiment will be described. In the above-described embodiment, the condition of the break of the region when recognizing the region is that the range in which it is assumed that there is a region is a blank character. However, in addition to this, when the left side of a line, which is a range assumed to have a region, has a certain number or more of blank characters, a condition for making a region break immediately before the line can be added. This means that if there is an indentation at the beginning of a paragraph, the area immediately before that can be used to divide the area.

【０１４７】次に、上記実施例では、段組を認識すると
きに、入力ファイルの一部または全部を読み込んで、各
カラムの空白文字を数えて空白文字率を計算し、その率
が高いカラムが連続する範囲を段組領域の境界であると
認識するとした。これを、特定の範囲の連続する行につ
いて、各カラムの各文字の頻度を計算し、その連続する
行で空白文字でなくても同じ文字がある率が高いカラム
を段組領域の境界であると認識するようにしてもよい。Next, in the above embodiment, when recognizing a column, a part or all of the input file is read, blank characters in each column are counted, and the blank character ratio is calculated. It is assumed that the continuous range is recognized as the boundary of the column area. This is done by calculating the frequency of each character in each column for consecutive lines in a specific range, and defining columns with a high rate of having the same characters even if they are not blank characters in the consecutive lines as the boundaries of the column area. May be recognized.

【０１４８】これにより、例えば、電子メールにおいて
行の左端に‘＞’を付けることで引用部分であることを
表示することがあるが、この引用部分を領域として認識
することができる。また、特に図表の線を構成する文字
を段組領域の境界と成り得る文字として領域を認識すれ
ば、上記実施例の入力ファイル（図１６）の行１６０１
６〜行１６０２０のような表のセルの内部を領域として
認識することができ、したがって、表の構成を保存して
その内容を翻訳することができる。As a result, for example, a quoted portion may be displayed by adding '>' to the left end of a line in an electronic mail, but this quoted portion can be recognized as a region. Further, if the area is recognized as a character that can form the boundary of the column area, especially if the characters forming the line of the chart are recognized, the line 1601 of the input file (FIG. 16) of the above embodiment is recognized.
The interior of a table cell, such as 6-row 16020, can be recognized as a region, and thus the structure of the table can be preserved and its contents translated.

【０１４９】次に、上記実施例では、ある行において段
組領域の境界であると仮定したカラムに空白文字あるい
は改行文字以外の文字があるときは、その境界を介して
隣接する２つの領域の幅の領域が始まるものと仮定する
とした。これを、その行から特定の範囲の行についてカ
ラムごとの空白文字率を計算することで、段組領域の境
界を認識するようにしてもよい。これにより、より複雑
な領域構成のテキストを認識できる。Next, in the above embodiment, when there is a character other than a blank character or a line feed character in a column, which is assumed to be the boundary of the column area in a certain line, two adjacent areas are separated via the boundary. Suppose the width region begins. The boundary of the column area may be recognized by calculating the blank character ratio for each column in a specific range from that line. As a result, it is possible to recognize a text having a more complicated area structure.

【０１５０】また、上記実施例では、入力ファイルのテ
キストフォーマットを認識して、翻訳した後に同じテキ
ストフォーマットの翻訳結果テキストを生成するとし
た。これを、テキストフォーマットを認識した後に、抽
出したテキストファイルを連結した結果の中に、タグ
（清書用コマンド）付のデータとしてテキストフォーマ
ットの認識結果を挿入するようにしてもよい。これによ
り、テキストフォーマットの変形および加工が容易にな
る。Further, in the above embodiment, the text format of the input file is recognized, and after translation, the translation result text in the same text format is generated. After recognizing the text format, the text format recognition result may be inserted as data with a tag (clean copy command) in the result of concatenating the extracted text files. This facilitates transformation and modification of the text format.

【０１５１】[0151]

【発明の効果】以上説明したように、本発明によれば、
ページフォーマットされた文書データを機械翻訳する場
合などにおいて、その入力文書からテキストフォーマッ
トを認識するとともに、翻訳結果の文書に対して元のペ
ージフォーマットを付与して、入力文書と同等なテキス
トフォーマットを持つ翻訳文書を生成することができ
る。したがって、従来は手作業で行なっていた面倒な作
業を自動化できるという効果がある。また、入力ファイ
ルがワードプロセッサとテキストフォーマッターなどに
よって作成されたデータである場合に限らず、印刷物を
光学的文字認識装置などで認識した結果を入力ファイル
とする場合にも有効である。As described above, according to the present invention,
When machine translation of page-formatted document data is performed, the text format is recognized from the input document, and the original page format is added to the translated document to have a text format equivalent to the input document. A translated document can be generated. Therefore, there is an effect that it is possible to automate the troublesome work that has been conventionally done manually. Further, it is effective not only when the input file is data created by a word processor and a text formatter, but also when the result of recognizing a printed matter by an optical character recognition device is used as the input file.

[Brief description of drawings]

【図１】本発明のテキストフォーマット認識生成方法の
処理フローチャートである。FIG. 1 is a processing flowchart of a text format recognition generation method of the present invention.

【図２】本発明のテキストフォーマット認識生成方法の
一部分であるページ長推定方法の処理フローチャートの
一部である。FIG. 2 is a part of a processing flowchart of a page length estimation method which is a part of the text format recognition generation method of the present invention.

【図３】本発明のテキストフォーマット認識生成方法の
一部分であるページ長推定方法の処理フローチャートの
一部である。FIG. 3 is a part of a processing flowchart of a page length estimation method which is a part of the text format recognition generation method of the present invention.

【図４】本発明のテキストフォーマット認識生成方法の
一部分であるページ長推定方法の処理フローチャートの
一部である。FIG. 4 is a part of a process flowchart of a page length estimation method which is a part of the text format recognition generation method of the present invention.

【図５】本発明のテキストフォーマット認識生成方法の
一部分であるページヘッダー認識方法の処理フローチャ
ートである。FIG. 5 is a processing flowchart of a page header recognition method which is a part of the text format recognition generation method of the present invention.

【図６】本発明のテキストフォーマット認識生成方法の
一部分であるページフッター認識方法の処理フローチャ
ートである。FIG. 6 is a processing flowchart of a page footer recognition method which is a part of the text format recognition generation method of the present invention.

【図７】本発明のテキストフォーマット認識生成方法の
一部分である領域認識方法の処理フローチャートの一部
である。FIG. 7 is a part of a processing flowchart of a region recognition method which is a part of the text format recognition generation method of the present invention.

【図８】本発明のテキストフォーマット認識生成方法の
一部分である領域認識方法の処理フローチャートの一部
である。FIG. 8 is a part of a process flowchart of a region recognition method which is a part of the text format recognition generation method of the present invention.

【図９】本発明のテキストフォーマット認識生成方法の
一部分である領域種別認識方法の処理フローチャートで
ある。FIG. 9 is a processing flowchart of a region type recognition method which is a part of the text format recognition generation method of the present invention.

【図１０】本発明のテキストフォーマット認識生成方法
の一部分である領域接続順序認識方法の処理フローチャ
ートの一部である。FIG. 10 is a part of a processing flowchart of a region connection order recognition method which is a part of the text format recognition generation method of the present invention.

【図１１】本発明のテキストフォーマット認識生成方法
の一部分である領域接続順序認識方法の処理フローチャ
ートの一部である。FIG. 11 is a part of a processing flowchart of a region connection order recognition method which is a part of the text format recognition generation method of the present invention.

【図１２】本発明のテキストフォーマット認識生成方法
の一部分であるテキスト・図表抽出方法の処理フローチ
ャートである。FIG. 12 is a processing flowchart of a text / figure extracting method which is a part of the text format recognition generating method of the present invention.

【図１３】本発明のテキストフォーマット認識生成方法
の一部分である翻訳不要部分認識方法の処理フローチャ
ートである。FIG. 13 is a processing flowchart of a translation unnecessary portion recognition method which is a part of the text format recognition generation method of the present invention.

【図１４】本発明のテキストフォーマット認識生成方法
の一部分であるテキストフォーマット生成方法の処理フ
ローチャートである。FIG. 14 is a processing flowchart of a text format generation method which is a part of the text format recognition generation method of the present invention.

【図１５】本発明のテキストフォーマット認識生成方法
が動作する装置の一例を表わす図である。FIG. 15 is a diagram showing an example of an apparatus on which the text format recognition generation method of the present invention operates.

【図１６】本発明のテキストフォーマット認識生成方法
の動作を説明するための入力ファイルの内容を表わす図
である。FIG. 16 is a diagram showing the contents of an input file for explaining the operation of the text format recognition generation method of the present invention.

【図１７】本発明のテキストフォーマット認識生成方法
の一部分であるページ長推定方法の動作を説明するため
の図である。FIG. 17 is a diagram for explaining the operation of the page length estimation method which is a part of the text format recognition generation method of the present invention.

【図１８】本発明のテキストフォーマット認識生成方法
の一部分であるページ長推定方法の動作を説明するため
の図である。FIG. 18 is a diagram for explaining the operation of the page length estimation method which is a part of the text format recognition generation method of the present invention.

【図１９】本発明のテキストフォーマット認識生成方法
の一部分であるページ長推定方法とページヘッダー認識
方法とページフッター認識方法の動作結果を表わす図で
ある。FIG. 19 is a diagram showing operation results of a page length estimating method, a page header recognizing method, and a page footer recognizing method, which are part of the text format recognition generating method of the present invention.

【図２０】本発明のテキストフォーマット認識生成方法
の一部分である領域認識方法の動作を説明するための図
である。FIG. 20 is a diagram for explaining the operation of the area recognition method which is a part of the text format recognition generation method of the present invention.

【図２１】本発明のテキストフォーマット認識生成方法
の一部分である領域認識方法の動作を説明するための図
である。FIG. 21 is a diagram for explaining the operation of the area recognition method which is a part of the text format recognition generation method of the present invention.

【図２２】本発明のテキストフォーマット認識生成方法
の一部分である領域認識方法の動作結果を表わす図であ
る。FIG. 22 is a diagram showing an operation result of the area recognition method which is a part of the text format recognition generation method of the present invention.

【図２３】図２２の内容を図形的に記述した図である。23 is a diagram graphically describing the contents of FIG. 22. FIG.

【図２４】本発明のテキストフォーマット認識生成方法
の一部分である領域種別認識方法の動作の結果を表わす
図である。FIG. 24 is a diagram showing the result of the operation of the area type recognition method which is a part of the text format recognition generation method of the present invention.

【図２５】本発明のテキストフォーマット認識生成方法
の一部分である領域接続順序認識方法の動作を説明する
ための図である。FIG. 25 is a diagram for explaining the operation of the area connection order recognition method which is a part of the text format recognition generation method of the present invention.

【図２６】図２５の内容を図形的に記述した図である。FIG. 26 is a diagram graphically describing the contents of FIG. 25.

【図２７】本発明のテキストフォーマット認識生成方法
の一部分である領域接続順序認識方法の動作結果を表わ
す図の一部である。FIG. 27 is a part of a diagram showing an operation result of the area connection order recognition method which is a part of the text format recognition generation method of the present invention.

【図２８】本発明のテキストフォーマット認識生成方法
の一部分である領域接続順序認識方法の動作結果を表わ
す図の一部である。FIG. 28 is a part of a diagram showing an operation result of the area connection order recognition method which is a part of the text format recognition generation method of the present invention.

【図２９】図２８の内容を図形的に記述した図である。FIG. 29 is a diagram graphically describing the contents of FIG. 28.

【図３０】本発明のテキストフォーマット認識生成方法
の一部分であるテキスト・図表抽出方法の処理結果を表
わす図である。FIG. 30 is a diagram showing a processing result of a text / figure extracting method which is a part of the text format recognition generating method of the present invention.

【図３１】本発明のテキストフォーマット認識生成方法
の一部分であるテキスト・図表抽出方法の処理結果を表
わす図である。FIG. 31 is a diagram showing a processing result of a text / figure extracting method which is a part of the text format recognition generating method of the present invention.

【図３２】本発明のテキストフォーマット認識生成方法
の一部分であるテキスト・図表抽出方法の処理結果を表
わす図である。FIG. 32 is a diagram showing a processing result of a text / figure extracting method which is a part of the text format recognition generating method of the present invention.

【図３３】本発明のテキストフォーマット認識生成方法
の一部分であるテキスト・図表抽出方法の処理結果を表
わす図である。FIG. 33 is a diagram showing a processing result of a text / figure extracting method which is a part of the text format recognition generating method of the present invention.

【図３４】本発明のテキストフォーマット認識生成方法
の一部分である翻訳不要部分認識方法の処理結果を表わ
す図である。FIG. 34 is a diagram showing a processing result of a translation unnecessary portion recognition method which is a part of the text format recognition generation method of the present invention.

【図３５】本発明のテキストフォーマット認識生成方法
の一部分である翻訳不要部分認識方法の処理結果を表わ
す図である。FIG. 35 is a diagram showing a processing result of a translation unnecessary portion recognition method which is a part of the text format recognition generation method of the present invention.

【図３６】本発明のテキストフォーマット認識生成方法
の一部分であるテキストフォーマット生成方法の処理結
果を表わす図である。FIG. 36 is a diagram showing a processing result of a text format generation method which is a part of the text format recognition generation method of the present invention.

【図３７】図３６の内容を図形的に記述した図である。FIG. 37 is a diagram graphically describing the contents of FIG. 36.

【図３８】本発明のテキストフォーマット認識生成方法
の処理結果を表わす図である。FIG. 38 is a diagram showing a processing result of the text format recognition generation method of the present invention.

【図３９】本発明のテキストフォーマット認識生成方法
の処理結果を表わす図である。FIG. 39 is a diagram showing a processing result of the text format recognition generation method of the present invention.

[Explanation of symbols]

１５００１…入力装置、１５００２…ファイル記憶装
置、１５００３…システム装置、１５００４…翻訳装
置、１５００５…表示装置。15001 ... Input device, 15002 ... File storage device, 15003 ... System device, 15004 ... Translation device, 15005 ... Display device.

Claims

[Claims]

1. A step of inputting a text-formatted text file such as page header attachment, page footing attachment, column arrangement, chart allocation, etc., a step of estimating the page length of the input text file, and page footer and / or Recognizing page headers, recognizing areas such as columns and charts, recognizing the type indicating whether the recognized areas are text areas or chart areas, and determining the connection order of recognized areas According to the above connection order, a step of extracting texts and figures that span a plurality of areas, a step of applying a predetermined conversion to the extracted texts and figures, and an input of the texts and figures of the conversion results. The step of giving the same format as the text file of Text formatting recognition generating method characterized by the.

2. The text format recognition generation method according to claim 1, wherein the predetermined conversion is a translation process.

3. The text format recognition generation method according to claim 2, further comprising recognizing an extracted text and a portion of the chart that does not require translation and adding a translation unnecessary designation to the portion for translation. Text format recognition generation method.

4. The text format recognition generation method according to claim 1, wherein some or all lines are read from the beginning of the input text file, line numbers are given from the beginning to each line, and a certain line is added. And the line on the end side of the file than that line is calculated, and the difference between the line numbers of the two lines that calculated the match is used as the line offset, and the set of two lines with a certain degree of match is equal to or greater than The operation of accumulating a set of the degree of coincidence and the row offset is performed for each row, the accumulation result is used as the row coincidence calculation result, the frequency of the row offset in the row coincidence calculation result is counted, and the row offset is assumed to be the page length. In this case, the ratio of the frequency of row offset to the number of pages calculated from the number of read rows is calculated, and the row offset with the highest frequency ratio is estimated to be the page length. , Text formatting recognition generating method characterized by.

5. The text format recognition generation method according to claim 4, wherein from the line matching degree calculation result, lines having a line offset equal to the page length are consecutive from the start line of the page toward the end of the file. A text format recognition generation method characterized by detecting whether there is such a line and recognizing such a line as a page header.

6. The text format recognition generation method according to claim 5, wherein for each line recognized as a page header, the corresponding line of each page is compared, and if there is a number in a column where characters do not match, that column is selected. A text format recognition and generation method characterized in that it is recognized as a page number field.

7. The text format recognition generation method according to claim 4, wherein from the line matching degree calculation result, lines having a line offset equal to the page length are consecutive from the end line of the page toward the beginning of the file. A text format recognition generation method characterized by detecting whether there is such a line and recognizing such a line as a page footer.

8. The text format recognition generation method according to claim 7, wherein for each line recognized as a page footer, the corresponding lines of each page are compared, and when there is a number in a column where characters do not match, that column is selected. A text format recognition and generation method characterized in that it is recognized as a page number field.

9. The text format recognition generation method according to claim 1, wherein part or all of the lines from the beginning of the input text file are read, and when the line length does not reach a certain column, that line's Assuming that there is a space character in the column and that the line feed character that indicates the end of the line is a space character, the number of space characters in each line is counted for each column, and the ratio of the number of space characters to the number of read lines is a certain value or more. A method for generating and recognizing a text format, characterized in that a range where columns are continuous is assumed to be a column boundary.

10. The text format recognition generation method according to claim 9, wherein if one column from one column to another column is assumed to be one column, all the characters in that range are in that range. White space, or
A text format recognition generation method, characterized in that when the line length does not reach the range, the region of the column is divided immediately before and after the line.

11. The text format recognition generation method according to claim 9, wherein when it is assumed that a column from one column to another column is one column, the characters in the range of a certain line are: A text format recognition generation method, characterized in that, when it starts with a certain number or more of blank characters, the column area is divided into that line and the immediately preceding line.

12. The text format recognition generation method according to claim 9, wherein if one column is assumed to extend from a certain column to another certain column, then at a certain row, at the boundary of the column. When there is a character other than a space character in a column that is assumed to exist, if there is an area of columns that are adjacent through that boundary, those areas will end immediately before that line, and that adjacent column will be combined from that line. A method for generating and recognizing a text format, which is characterized in that recognition of a region is continued on the assumption that a region of a column having a predetermined width exists.

13. The text format recognition generation method according to claim 1, wherein '+', '-',
An area where the ratio of the number of characters such as' | 'and'‖' that are frequently used to draw a chart to the total number of characters excluding blank characters and line feed characters in that area is a certain value or more And a region other than that is recognized as a text region.

14. The text format recognition generation method according to claim 1, wherein when there is an area vertically adjacent to an area in a page in the input text file, those areas are above. To the bottom, and if there are adjacent areas on the left and right, connect from the bottom-most area on the left side of those areas to the top-most area on the right, but when determining these connections. It is assumed that the area for which the connection has already been established will not be reconnected, but the area for which the connection has been determined will be skipped, and as a result, a one-line connection order relationship will be determined for each area in the page. Text format recognition and generation method.

15. The text format recognition generation method according to claim 14, wherein an area in a page in an input text file is connected from an end area to an upper left area on the leftmost side of the next page. A method for generating and recognizing a text format, characterized by:

16. The text format recognition generation method according to claim 14 or 15, wherein after the connection order of the areas is determined, the text area is connected to the chart area immediately after the text area. It is assumed that the areas are skipped and connected to the next area, and the skipped chart areas retain their original order to determine the connection order of those chart areas, and as a result, the text area column and the chart area column are A method for generating and recognizing a text format, which determines a connection order of columns.

17. The text format recognition generation method according to claim 14, wherein when another text area is connected immediately after a certain text area after determining the connection order of the areas, the two text areas are connected. There is a large difference in the used word frequency distribution between the two, the words at the boundary of the area are not properly connected, and the text area and the used word frequency distribution in the adjacent column behind the other text area are Text that is similar and that changes the connection order of areas so that the text area is connected immediately after a certain text area when there is an appropriate text area with the word connection at the boundary of the area. Format recognition generation method.

18. The text format recognition generation method according to claim 1, wherein when extracting the text from a series of areas, the text is extracted only from the area recognized as the text area, A text format recognition generation method, characterized in that the texts are connected according to a connection order.

19. The text format recognition generation method according to claim 3, wherein '+', '-', 'x',
Characters often used to describe mathematical expressions such as '÷', '^', '±', equal sign, inequality sign, 'Σ', and '|', '‖'
For example, a line that has a ratio of the number of characters that are frequently used to draw a chart, etc., and the total number of characters excluding white space characters and line feed characters in the line is a certain value or more is determined to be a translation unnecessary line. Characteristic text format recognition generation method.

20. The text format recognition generation method according to claim 19, wherein the threshold value that is the constant value is changed according to the type of the area to which the line belongs.

21. The text format recognition generation method according to claim 1, wherein when the result of converting the text in the figure area exceeds the number of lines and columns occupied by the original area, an area for embedding the conversion result The number of rows and columns is increased so that the conversion result can be accommodated, and with the increase, another area adjacent to that area can be moved or reduced to fit on the page. Text format recognition generation method.

22. In the text format recognition generation method according to any one of claims 1 to 3, when the result of converting the text does not fit in the original formatted text area, the recognized page header, page footer, and A text format recognition generation method characterized in that a page having a format conforming to a column is newly generated, and a conversion result is set in an area of the generated page.

23. In the text format recognition generation method according to claim 1, when a space is generated in the area after the result of converting the text is put in an area corresponding to the original area of the text. A text format recognition generation method characterized by including a space character or a line feed character in the margin.

24. In the text format recognition generation method according to claim 1, when a format equivalent to that of the input text file is added to the conversion result, all areas of a page are only blank characters or line feed characters. When the page is deleted, the text format recognition generation method characterized by the above.

25. The text format recognition generation method according to claim 1, wherein a page length, a page footer, a page header, a column number, and a column are included in the text and the chart data extracted from the input text file. A text format recognition and generation method characterized in that the recognition result of a text format such as a set width and the recognition result of a line not requiring translation are embedded as tagged data.

26. The text format recognition generation method according to claim 8, wherein instead of calculating the ratio of the number of blank characters in each column, the appearance rate for each character type appearing in each column is calculated, and the appearance rate is constant. It is assumed that the continuous range of the column where the character that is more than the value exists is the boundary of the area.
A text format recognition and generation method characterized by the above.

27. The text format recognition generation method according to claim 8, 9 or 11, wherein, in addition to a blank character and a line feed character, a character frequently used for drawing a chart is regarded as a boundary of the area. A method for generating and recognizing a text format, characterized by:

28. A means for inputting a text-formatted text file such as page header, page footer, column setting, chart layout, etc., means for estimating the page length of the input text file, and page footer and / or A means for recognizing page headers, a means for recognizing areas such as columns and charts, a means for recognizing the type indicating whether the recognized areas are text areas or chart areas, and means for deciding the connection order of the recognized areas. According to the above connection order, a means for extracting texts and figures that span multiple areas, a means for performing a predetermined conversion on the extracted texts and figures, and an input for the texts and figures resulting from the conversions. Text file with a format equivalent to that of the text file of Tsu door recognition generating device.

29. The text format recognition generation device according to claim 28, wherein the predetermined conversion is a translation process.

30. The text format recognition generation device according to claim 29, further comprising: recognizing an extracted text and a translation unnecessary portion of the chart, and adding a translation unnecessary designation to the portion for translation. Text format recognition generator.