JPH08180131A

JPH08180131A - Image processing method

Info

Publication number: JPH08180131A
Application number: JP6318285A
Authority: JP
Inventors: Tadanori Nakatsuka; 忠則中塚
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1994-12-21
Filing date: 1994-12-21
Publication date: 1996-07-12

Abstract

PURPOSE: To save the trouble of correction, etc., by giving read order to character areas by using the continuity of the character areas found by analyzing images of characters or documents in the character areas when it is judged whether or not the character areas are continuous. CONSTITUTION: To check the continuity of the character areas 21 and 22, the two areas are taken out and while the character area 21 is regarded as a basic character area, the character area 22 is regarded as a compared character area. When the basic character area is determined, plural character areas extracted from an input image are determined in order from an area which is close to the start point that is the right upper point of each character area where a document is longitudinally written or the left upper point when the document is laterally written. The continuity of a candidate character area of the compared character area B is judged and when it is judged that only one character area is continuous, the character area is decided as an area succeeding to the basic character area, but when there are plural continuous areas, the area having the maximum continuity is determined as a succeeding area.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、ＯＣＲ（光学的文字認
識）装置、複写機、ファクシミリ、ＤＴＰ（デスクトッ
プパブリッシング）等の電子装置において、入力画像の
領域単位に解析する画像処理方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an image processing method for analyzing an input image in a region unit in an electronic device such as an OCR (optical character recognition) device, a copying machine, a facsimile, and a DTP (desktop publishing). is there.

【０００２】[0002]

【従来の技術】従来の入力画像から文字領域を抽出し、
各文字領域に対する順序付け方法として、各文字領域が
縦書きの場合は右から左方向に順序を付け、更に文字領
域が上下方向にも複数存在する場合には左右方向に順序
付けした後上から下方向に順序を付けていくもの等、各
文字領域の相対位置から順序付けを行う方法がある。2. Description of the Related Art A character area is extracted from a conventional input image,
As an ordering method for each character area, if each character area is written vertically, order from right to left, and if there are multiple character areas in the vertical direction as well, order them horizontally and then from top to bottom. There is a method of ordering from the relative position of each character area, such as ordering.

【０００３】[0003]

【発明が解決しようとしている課題】しかしながら、上
述の従来技術では新聞記事など、原稿の中に複数の記事
が存在し、隣り合う文字領域が異なる記事に属し、互い
に連続するものではない場合や図２の例のように領域２
１、２２と領域２３、２４の間に文字以外の領域が存在
して仕切りとなり、各文字領域の位置からでは２１、２
２、２３、２４の順番なのか、２１、２３、２２、２４
の順番なのか区別できない場合に、各領域を読む順とし
ての正しい読み順を付けることができず、処理装置の決
定した順番は修正などが必要となるという問題点があっ
た。However, in the above-mentioned prior art, when there are a plurality of articles in the manuscript, such as newspaper articles, and adjacent character areas belong to different articles and are not continuous with each other, Area 2 as in example 2
Areas other than characters exist between areas 1 and 22 and areas 23 and 24 to form a partition, and from the position of each character area, 21, 2
2, 23, 24 or 21, 23, 22, 24?
When it is not possible to distinguish whether or not it is the order, there is a problem that the correct reading order cannot be added as the reading order of each area, and the order determined by the processing device needs to be corrected.

【０００４】本発明の目的は、原稿の中に新聞記事な
ど、複数の記事が存在する場合や図２の例のように位置
からでは２１、２２、２３、２４の順番なのか、２１、
２３、２２、２４の順番なのか区別できず、正しい順番
を付けることができない場合でも、正しい読み順序を付
け、修正の手間を減らすことにある。The purpose of the present invention is to determine whether there are a plurality of articles such as newspaper articles in a manuscript, or if the order is 21, 22, 23, 24 from the position as in the example of FIG.
Even if the order of 23, 22, and 24 cannot be distinguished and the correct order cannot be added, the correct reading order is added to reduce the trouble of correction.

【０００５】[0005]

【課題を解決するための手段】上述の課題を解決するた
めに、本発明は原稿画像を記憶し、前記原稿画像に存在
する少なくとも二つの文字領域に係る領域情報を格納
し、前記文字領域の内、二つの文字領域に含まれる文章
が連続するものであるか否かを判断する画像処理方法を
提供する。In order to solve the above-mentioned problems, the present invention stores a document image, stores region information relating to at least two character regions existing in the document image, and stores the region information of the character region. Provided is an image processing method for determining whether or not sentences included in two character areas are continuous.

【０００６】上述の課題を解決するために、本発明は好
ましくは前記文章が連続するか否かの判断は、当該二つ
の文字領域に含まれる文章を解析して判断する。In order to solve the above-mentioned problems, the present invention preferably determines whether or not the sentences are continuous by analyzing the sentences included in the two character areas.

【０００７】上述の課題を解決するために、本発明は好
ましくは前記格納した文字領域に係る領域情報は、前記
記憶した原稿画像を領域分離して抽出したものとする。In order to solve the above-mentioned problems, the present invention is preferably arranged such that the area information relating to the stored character area is extracted by separating the stored original image into areas.

【０００８】上述の課題を解決するために、本発明は好
ましくは前記文字領域に含まれる文章の解析は、当該文
字領域に含まれる画像情報を文字認識して求めた文字に
関して解析する。In order to solve the above-mentioned problems, the present invention preferably analyzes the sentence included in the character area with respect to a character obtained by character recognition of image information included in the character area.

【０００９】上述の課題を解決するために、本発明は好
ましくは前記二つの文字領域に含まれる文章が連続する
ものであるか否かの判断は、連続性の指標により決定す
る。In order to solve the above-mentioned problems, the present invention preferably determines whether or not the sentences included in the two character areas are continuous by determining the continuity index.

【００１０】上述の課題を解決するために、本発明は好
ましくは前記連続性の指標は、文字領域の最後に矢印が
存在した場合に、矢印の示す方向に存在する文字領域に
対して大きくする。In order to solve the above problems, the present invention preferably increases the continuity index with respect to a character area existing in the direction indicated by the arrow when an arrow exists at the end of the character area. .

【００１１】上述の課題を解決するために、本発明は好
ましくは前記連続性の指標は、文字領域の最後に文章の
末尾を示す記号が存在した場合に、最初に時差下してい
る文字領域に対して大きくする。In order to solve the above-mentioned problems, the present invention is preferably such that the continuity index is the character area that is initially staggered when a symbol indicating the end of a sentence is present at the end of the character area. Increase to.

【００１２】上述の課題を解決するために、本発明は好
ましくは前記文章の末尾を示す記号は、句点とする。In order to solve the above-mentioned problems, in the present invention, the symbol indicating the end of the sentence is preferably a punctuation mark.

【００１３】上述の課題を解決するために、本発明は好
ましくは前記文章の末尾を示す記号は、ピリオドとす
る。In order to solve the above-mentioned problems, in the present invention, the symbol indicating the end of the sentence is preferably a period.

【００１４】上述の課題を解決するために、本発明は好
ましくは前記一方の文字領域の最後の文と他方の文字領
域の最初の文の、一つの文としての確からしさを用い
て、前記連続性の指標を決定する。In order to solve the above-mentioned problems, the present invention preferably uses the certainty as one sentence of the last sentence of the one character region and the first sentence of the other character region to make the continuation. Determine the sex index.

【００１５】上述の課題を解決するために、本発明は好
ましくは前記連続性の指標の決定は、文字領域の最後が
文章の末尾を示す記号でない場合に行う。In order to solve the above problems, the present invention preferably determines the continuity index when the end of the character area is not a symbol indicating the end of a sentence.

【００１６】上述の課題を解決するために、本発明は好
ましくは前記一つの文としての確からしさは、一方の文
字領域の最後が名詞で終了している場合に、他方の文字
領域の最初の文が助詞で始まる時に一つの文としての確
からしさを大きくする。In order to solve the above-mentioned problems, the present invention preferably has a certainty of the above-mentioned one sentence in that when one character region ends with a noun, the other character region begins with the first character region. When a sentence starts with a particle, it increases the certainty as a sentence.

【００１７】上述の課題を解決するために、本発明は好
ましくは前記一つの文としての確からしさは、一方の文
字領域の最後の文が主語を含むが述語を含まない場合
に、他方の文字領域の最初の文が主語を含まずかつ述語
を含む時に一つの文としての確からしさを大きくする。In order to solve the above-mentioned problems, the present invention preferably has a certainty that the last sentence of one character area contains the subject but does not contain the predicate of the other character. When the first sentence of the area does not include the subject but includes the predicate, the certainty as one sentence is increased.

【００１８】上述の課題を解決するために、本発明は好
ましくは前記一つの文としての確からしさは、前記一方
の文字領域に含まれる主語と、他方の文字領域に含まれ
る述語との関連土を用いて求める。In order to solve the above-mentioned problems, the present invention is preferably such that the certainty of the one sentence is related to the subject included in the one character region and the predicate included in the other character region. Calculate using.

【００１９】上述の課題を解決するために、本発明は好
ましくは前記文章が連続するか否かの判断は、両文字領
域に共通して存在する単語または類義語の存在割合を用
いて判断する。In order to solve the above-mentioned problems, the present invention preferably judges whether or not the sentence is continuous by using the existence ratio of words or synonyms commonly existing in both character areas.

【００２０】上述の課題を解決するために、本発明は好
ましくは前記文章が連続するか否かの判断は、文字領域
間の類似度を用いて求める。In order to solve the above-mentioned problems, the present invention preferably determines whether or not the sentences are continuous by using the similarity between the character areas.

【００２１】上述の課題を解決するために、本発明は好
ましくは前記文字領域間の類似度は、当該文字領域間の
文章表現について判断する。In order to solve the above-mentioned problems, the present invention preferably judges the similarity between the character areas with respect to the text expression between the character areas.

【００２２】上述の課題を解決するために、本発明は好
ましくは前記文章表現は、丁寧さとする。In order to solve the above-mentioned problems, the present invention preferably makes the text expressions polite.

【００２３】上述の課題を解決するために、本発明は好
ましくは前記文章表現は、行末表現とする。In order to solve the above-mentioned problems, the present invention preferably makes the sentence expression a line-end expression.

【００２４】上述の課題を解決するために、本発明は好
ましくは前記文章表現は、待遇表現とする。In order to solve the above-mentioned problems, the present invention preferably makes the text expression a treatment expression.

【００２５】上述の課題を解決するために、本発明は好
ましくは前記文字領域間の類似度は、文字領域内のジャ
ンルごとの構成割合について判断する。In order to solve the above-mentioned problems, the present invention preferably judges the similarity between the character areas with respect to the composition ratio for each genre in the character area.

【００２６】上述の課題を解決するために、本発明は好
ましくは前記ジャンルとは、漢字とする。In order to solve the above-mentioned problems, in the present invention, the genre is preferably Chinese characters.

【００２７】上述の課題を解決するために、本発明は好
ましくは前記ジャンルとは、ひらがなとする。In order to solve the above problems, the present invention preferably defines the genre as hiragana.

【００２８】上述の課題を解決するために、本発明は好
ましくは前記ジャンルとは、カタカナとする。In order to solve the above-mentioned problems, in the present invention, the genre is preferably katakana.

【００２９】上述の課題を解決するために、本発明は好
ましくは前記ジャンルとは、記号とする。In order to solve the above-mentioned problems, in the present invention, the genre is preferably a symbol.

【００３０】上述の課題を解決するために、本発明は好
ましくは前記ジャンルとは、数字とする。In order to solve the above-mentioned problems, in the present invention, the genre is preferably a numeral.

【００３１】上述の課題を解決するために、本発明は好
ましくは前記ジャンルとは、英字とする。In order to solve the above problems, in the present invention, the genre is preferably an English character.

【００３２】上述の課題を解決するために、本発明は好
ましくは前記文字領域間の類似度は、文字領域内の書式
に基づいて判断する。In order to solve the above problems, the present invention preferably determines the similarity between the character areas based on the format in the character areas.

【００３３】上述の課題を解決するために、本発明は好
ましくは前記書式は、フォントとする。In order to solve the above-mentioned problems, the present invention preferably makes the format a font.

【００３４】上述の課題を解決するために、本発明は好
ましくは前記書式は、文字大きさとする。In order to solve the above-mentioned problems, the present invention preferably makes the format a character size.

【００３５】上述の課題を解決するために、本発明は好
ましくは前記書式は、行長さとする。In order to solve the above-mentioned problems, the present invention preferably makes the format a line length.

【００３６】上述の課題を解決するために、本発明は好
ましくは前記書式は、文字ピッチとする。In order to solve the above-mentioned problems, the present invention preferably has the character pitch.

【００３７】上述の課題を解決するために、本発明は好
ましくは前記書式は、行ピッチとする。In order to solve the above-mentioned problems, the present invention preferably makes the format a line pitch.

【００３８】上述の課題を解決するために、本発明は好
ましくは前記書式は、文字の傾きとする。In order to solve the above-mentioned problems, the present invention is preferably arranged such that the format is the inclination of characters.

【００３９】上述の課題を解決するために、本発明は好
ましくは前記書式は、行の傾きとする。In order to solve the above-mentioned problems, the present invention preferably sets the format to be line inclination.

【００４０】上述の課題を解決するために、本発明は好
ましくは前記書式は、文字間の隙間とする。In order to solve the above-mentioned problems, the present invention is preferably such that the format is a space between characters.

【００４１】上述の課題を解決するために、本発明は好
ましくは前記書式は、行間の隙間とする。In order to solve the above-mentioned problems, the present invention preferably makes the format a space between lines.

【００４２】上述の課題を解決するために、本発明は好
ましくは前記書式は、組方向とする。In order to solve the above-mentioned problems, the present invention is preferably such that the format is a set direction.

【００４３】[0043]

【実施例】図１８は本実施例における装置の構成を表す
ブロック図である。１００１は本装置全体の処理を実行
するＣＰＵであって、ＲＯＭ１００２に格納されている
制御プログラムに従って判断及び処理を制御する。１０
０２はＲＯＭであり、本実施例において説明するフロー
チャートの制御プログラム、或いは処理に用いる予め定
まっているパラメータ等のデータを記憶している。１０
０３はＲＡＭであり、ＣＰＵ１００１での処理中のデー
タを記憶するワーキングメモリエリアを備える。１００
４はスキャナであり、原稿画像を光学的に読み込む。読
み込まれた画像データはＲＡＭ１００３に記憶できる。
１００５はキーボードであり、各種コードの入力、オペ
レータの指示が入力できる。１００６はポインティング
デバイスであり、表示器１００７の表示画面上の所望の
位置を指示でき、また、ボタンをクリックすることによ
り選択、取消の指示を入力することもできる。１００７
は表示器であり、ＣＲＴ或いは液晶表示器からなる。１
００８は例えばＬＢＰ、インクジェット式等のプリン
タ、１００９は例えばＦＤなどの外部記憶装置、１０１
０はこれら各構成間でのデータの授受を行う為のデータ
バスである。[Embodiment] FIG. 18 is a block diagram showing the arrangement of an apparatus according to this embodiment. Reference numeral 1001 denotes a CPU that executes processing of the entire apparatus, and controls judgment and processing according to a control program stored in the ROM 1002. 10
Reference numeral 02 denotes a ROM, which stores a control program of the flowchart described in this embodiment, or data such as predetermined parameters used for processing. 10
A RAM 03 has a working memory area for storing data being processed by the CPU 1001. 100
Reference numeral 4 denotes a scanner that optically reads a document image. The read image data can be stored in the RAM 1003.
A keyboard 1005 is used to input various codes and operator's instructions. A pointing device 1006 can instruct a desired position on the display screen of the display 1007, and can also input a selection or cancellation instruction by clicking a button. 1007
Is a display, which is a CRT or a liquid crystal display. 1
Reference numeral 008 is, for example, an LBP or inkjet printer, 1009 is an external storage device such as FD, 101
Reference numeral 0 is a data bus for exchanging data between these components.

【００４４】（実施例１）図１は、本発明にかかわる実
施例の順序付け処理を表すフローチャートである。(Embodiment 1) FIG. 1 is a flowchart showing an ordering process of an embodiment according to the present invention.

【００４５】図２は、スキャナ１００４或いはＦＤ１０
０９等のメモリから入力した原稿画像の例であり、この
画像データはＲＡＭ１００３に格納される。FIG. 2 shows the scanner 1004 or FD10.
09 is an example of a document image input from a memory such as 09, and this image data is stored in the RAM 1003.

【００４６】入力した原稿画像は画像全体の垂直及び水
平方向にヒストグラムをとり、この結果を解析すること
等により、文字列或いは文章がある程度固まって存在し
ている文字領域を抽出し、抽出された領域の位置情報は
ＲＡＭ１００３に格納され、後に各領域ごとの画像の解
析処理を行う際には、この領域の位置情報から特定され
る画像情報をＲＡＭ１００３から取り出して行う。２
１、２２、２３、２４は抽出された文字領域である。The inputted original image is taken as a histogram in the vertical and horizontal directions of the entire image, and the result is analyzed to extract a character region where a character string or a sentence is fixed to some extent, and is extracted. The position information of the area is stored in the RAM 1003, and when the image is analyzed for each area later, the image information specified from the position information of this area is extracted from the RAM 1003. Two
1, 22, 23, and 24 are extracted character areas.

【００４７】２５は、写真領域である。Reference numeral 25 is a photographic area.

【００４８】図１における各処理ステップを説明する。Each processing step in FIG. 1 will be described.

【００４９】ステップＳ１０１連続するか調べたい二つの文字領域ＡとＢを取り出し
（ＡからＢへの連続性を調べる）、Ａを基本文字領域、
Ｂを比較文字領域とする。Step S101: Take out two character areas A and B that are to be checked for continuity (check the continuity from A to B), A is the basic character area,
Let B be the comparison character area.

【００５０】図２の例では、文字領域２１に続く文字領
域が基本文字領域２１の左隣りにある文字領域２２か、
基本文字領域２１の下にある文字領域２３のいずれかで
あるか、位置からでは判断できない。そこで、文字領域
２１と文字領域２２、文字領域２１と文字領域２３の文
章的な連続性を調べる必要がある。ここでは、まず初め
に文字領域２１と文字領域２２の連続性を調べるため、
この２領域を取り出し、文字領域２１を基本文字領域、
文字領域２２を比較文字領域とする。ここで、基本文字
領域の決定は、入力画像から抽出された複数の文字領域
のうち、各文字領域の文章が縦書きの場合は右上を始点
にしてこの始点に近い領域から順に決定し、また、各文
字領域の文章が横書きの場合は左上を始点にしてこの始
点に近い領域から順に決定する。また、比較文字領域の
決定は、基本文字領域が縦書きの場合は下及び左の縦書
きの文字領域とし、基本文字領域が横書きの場合は下及
び右の横書きの文字領域とする。In the example of FIG. 2, the character area following the character area 21 is the character area 22 adjacent to the left of the basic character area 21, or
It cannot be determined from the position whether it is one of the character areas 23 below the basic character area 21. Therefore, it is necessary to check the textual continuity between the character area 21 and the character area 22, and between the character area 21 and the character area 23. Here, in order to check the continuity of the character areas 21 and 22 first,
These two areas are taken out and the character area 21 is set to the basic character area,
The character area 22 is used as a comparison character area. Here, in the determination of the basic character area, among the plurality of character areas extracted from the input image, when the text of each character area is vertical writing, the upper right is set as the starting point, and the area near the starting point is determined in order. If the text in each character area is written horizontally, the upper left corner is set as the starting point, and the areas near the starting point are sequentially determined. Further, the comparison character area is determined as the lower and left vertical writing character areas when the basic character area is vertical writing, and the lower and right horizontal writing character areas when the basic character area is horizontal writing.

【００５１】ステップＳ１０２次に基本文字領域から比較文字領域への連続性Ｃを求め
る。この、連続性Ｃの求め方について後に様々な方法
を詳細に説明する。Step S102 Next, the continuity C from the basic character area to the comparative character area is obtained. Various methods for obtaining the continuity C will be described in detail later.

【００５２】ステップＳ１０３連続性Ｃを閾値αと比較する。Step S103 The continuity C is compared with the threshold value α.

【００５３】Ｃ≧α この式を充たす時は、ステップＳ１０４に進む。また、
充たさない時は、ステップＳ１０５に進む。ただし、こ
こでは閾値αは１．０とする。C ≧ α When this expression is satisfied, the process proceeds to step S104. Also,
If not satisfied, the process proceeds to step S105. However, the threshold value α is 1.0 here.

【００５４】図３の例では寝連続性ＣはＣ＝１．０で、
式を充たすのでステップＳ１０４に進む。In the example of FIG. 3, the sleep continuity C is C = 1.0,
Since the formula is satisfied, the process proceeds to step S104.

【００５５】ステップＳ１０４「連続する」と判定する。Step S104 It is determined that "continuous".

【００５６】図３の例では、「連続する」と判定する。In the example of FIG. 3, it is determined to be "continuous".

【００５７】ステップＳ１０５「連続しない」と判定する。Step S105 It is determined that "not continuous".

【００５８】同様に、基本文字領域が文字領域２１、比
較文字領域が文字領域２３の場合も連続性を判断する。Similarly, when the basic character area is the character area 21 and the comparison character area is the character area 23, the continuity is determined.

【００５９】このように、比較文字領域Ｂの候補文字領
域について全て連続性を判断し、「連続する」と判断さ
れる文字領域が一つであればその文字領域を基本文字領
域に連続する領域と決定し、「連続する」と判断される
文字領域が複数である場合は、それらの中から連続性Ｃ
の最も大きい領域を連続する領域と決定し、「連続す
る」と判断される領域がなかった場合はその基本文字領
域で連続する領域グループは完結すると判断する。In this way, the continuity of all candidate character areas of the comparison character area B is judged, and if there is one character area judged to be "continuous", that character area is continuous with the basic character area. If there are a plurality of character areas that are determined to be “continuous”, the continuity C is selected from among them.
The largest area is determined as a continuous area, and if there is no area that is determined to be "continuous", it is determined that the continuous area group in the basic character area is completed.

【００６０】以上で順序付けの処理を終了する。This completes the ordering process.

【００６１】このような、文字領域ごとの連続性を判断
する処理を、基本文字領域Ａを更新しながら繰り返すこ
とにより、入力原稿画像から抽出された複数の文字領域
全てについて（或いは処理対象として特定されている文
字領域全てについて）連続性を判断し、決定された順に
従って各文字領域に含まれる文字群の認識結果を接続
し、入力原稿画像の認識結果としてテキスト表示するこ
とができる。By repeating such a process of determining continuity for each character area while updating the basic character area A, all the plurality of character areas extracted from the input original image (or specified as a processing target) are identified. It is possible to judge the continuity (for all the character areas that are displayed), connect the recognition results of the character groups included in each character area according to the determined order, and display the text as the recognition result of the input original image.

【００６２】以下、Ｓ１０３の連続性Ｃの求め方の様々
な例について基本文字領域Ａが領域２１、比較文字領域
Ｂが領域２２である場合を例に詳細に説明する。Hereinafter, various examples of how to obtain the continuity C in S103 will be described in detail by taking the case where the basic character area A is the area 21 and the comparative character area B is the area 22 as an example.

【００６３】図３は、図２に示す原稿画像の文字領域２
１の最後の文字が矢印である例を示した図である。図に
おいて、３１は矢印である。FIG. 3 shows a character area 2 of the original image shown in FIG.
It is the figure which showed the example which the last character of 1 is an arrow. In the figure, 31 is an arrow.

【００６４】３２、３３は、矢印に原点を合わせた直行
座標軸である。Reference numerals 32 and 33 are orthogonal coordinate axes whose origin is aligned with the arrow.

【００６５】３４は、文字領域２２の右上角の点であ
る。Reference numeral 34 is a point at the upper right corner of the character area 22.

【００６６】３５は、矢印の右下、つまり座標軸３２、
３３からなる直行座標系の第４象限である。35 is the lower right of the arrow, that is, the coordinate axis 32,
It is the fourth quadrant of the orthogonal coordinate system consisting of 33.

【００６７】図４は、ステップＳ１０２についての第１
の例の詳細なフローチャートである。FIG. 4 shows the first step S102.
2 is a detailed flowchart of the example of FIG.

【００６８】図４のフローチャートに従って、ステップ
Ｓ１０２を説明する。Step S102 will be described with reference to the flowchart of FIG.

【００６９】まず初めに、ステップＳ４０１で、連続性
Ｃに０．０を代入して初期化する。First, in step S401, 0.0 is substituted for continuity C for initialization.

【００７０】次に、ステップＳ４０２で、基本文字領域
の最後の文字が矢印か判定する。矢印の場合は、ステッ
プＳ４０３に進む。また、矢印ではない場合は、ステッ
プＳ１０２を終了する。つまり、連続性は０．０のまま
で変化しない。Next, in step S402, it is determined whether the last character in the basic character area is an arrow. In the case of an arrow, the process proceeds to step S403. If it is not an arrow, step S102 ends. That is, the continuity remains 0.0 and does not change.

【００７１】図３の例では、基本文字領域を文字領域２
１として最後の文字が矢印か判定する。最後の文字は、
図３の矢印３１で示す通り、右下に向いた矢印であるの
でステップＳ４０３に進む。In the example of FIG. 3, the basic character area is the character area 2
It is determined as 1 whether the last character is an arrow. The last letter is
As shown by the arrow 31 in FIG. 3, since the arrow points to the lower right, the process proceeds to step S403.

【００７２】次にステップＳ４０３で、矢印の方向に比
較文字領域があるか判定する。矢印の方向に比較文字領
域があれば、ステップＳ４０４に進む。なければ、ステ
ップＳ１０２を終了する。つまり、連続性は０．０のま
まで変化しない。Next, in step S403, it is determined whether or not there is a comparison character area in the direction of the arrow. If there is a comparison character area in the direction of the arrow, the process proceeds to step S404. If not, step S102 ends. That is, the continuity remains 0.0 and does not change.

【００７３】図３の例では、矢印３１は右下に向いた矢
印であるので、座標軸３２、３３からなる、矢印に原点
を合わせた直行座標系の第４象限に、文字領域２２の右
上角（横書きの場合は文字領域の左上角）の点３４が入
っているので、矢印の方向に比較文字領域２２があると
判定し、ステップＳ４０４に進む。In the example of FIG. 3, since the arrow 31 is an arrow pointing to the lower right, the upper right corner of the character area 22 is in the fourth quadrant of the orthogonal coordinate system which is composed of the coordinate axes 32 and 33 and whose origin is aligned with the arrow. Since the point 34 (the upper left corner of the character area in the case of horizontal writing) is included, it is determined that the comparative character area 22 exists in the direction of the arrow, and the process proceeds to step S404.

【００７４】ステップＳ４０４では、連続性Ｃに１．０
を加えて連続性を大きくする。At step S404, the continuity C is set to 1.0.
To increase continuity.

【００７５】図３の例では、連続性Ｃは０．０に１．０
を加えて１．０となる。In the example of FIG. 3, the continuity C is 0.0 to 1.0.
To 1.0.

【００７６】以上で、ステップＳ１０２を終了する。Thus, step S102 is completed.

【００７７】図３のフローチャートに示す例では、文字
領域間の連続性を文字領域最後の矢印によって大きくし
たが、文字領域の最後に句点またはピリオドが存在した
場合に、最初に字下げしている文字領域に対する連続性
を大きくしても良い。In the example shown in the flowchart of FIG. 3, the continuity between the character areas is increased by the arrow at the end of the character area. However, when a punctuation mark or a period exists at the end of the character area, the character is indented first. The continuity with respect to the character area may be increased.

【００７８】以下、連続性Ｃの求め方について詳細に説
明する。The method of obtaining the continuity C will be described in detail below.

【００７９】図５は、図２に示す原稿画像の文字領域２
１、２２、２３に関して最初や最後の文字を示した図で
ある。図において、５１は文字領域２１の最後の文字か
つ句点である。FIG. 5 shows the character area 2 of the original image shown in FIG.
It is the figure which showed the first or last character regarding 1,22,23. In the figure, 51 is the last character and phrase in the character area 21.

【００８０】５２は、文字領域２２の最初の字下げ部分
である。Reference numeral 52 is the first indentation portion of the character area 22.

【００８１】５３は、文字領域２３の最初の文字の
「速」である。53 is the "fast" of the first character in the character area 23.

【００８２】図９は、ステップＳ１０２についての第２
の例の詳細なフローチャートである。FIG. 9 shows the second step S102.
2 is a detailed flowchart of the example of FIG.

【００８３】図９のフローチャートに従って、ステップ
Ｓ１０２を説明する。Step S102 will be described with reference to the flowchart of FIG.

【００８４】まず初めに、ステップＳ９０１で、連続性
Ｃに０．０を代入して初期化する。First, in step S901, 0.0 is substituted for continuity C for initialization.

【００８５】次に、ステップＳ９０２で、基本文字領域
の最後の文字が句点またはピリオドか判定する。句点ま
たはピリオドの場合はステップＳ９０３に進む。句点で
もピリオドでもない場合はステップＳ１０２を終了す
る。Next, in step S902, it is determined whether the last character in the basic character area is a punctuation mark or a period. If it is a punctuation mark or a period, the process proceeds to step S903. If it is neither a punctuation mark nor a period, step S102 ends.

【００８６】図５の例において、基本文字領域が文字領
域２１、比較文字領域が文字領域２２の場合を説明する
と、基本文字領域２１の最後は句点５１で終了している
ので、ステップＳ９０３に進む。In the example of FIG. 5, the case where the basic character area is the character area 21 and the comparison character area is the character area 22 is explained. Since the end of the basic character area 21 ends at the punctuation mark 51, the process proceeds to step S903. .

【００８７】ステップＳ９０３では、比較文字領域の最
初が字下げになっているか判定する。字下げになってい
れば、ステップＳ９０４に進む。なっていなければ、ス
テップＳ１０２を終了する。In step S903, it is determined whether the first character in the comparison character area is indented. If it is indented, the process proceeds to step S904. If not, step S102 ends.

【００８８】図５の例では、比較文字領域２２の最初が
字下げ５２になっているのでステップＳ９０４に進む。In the example of FIG. 5, since the first character in the comparison character area 22 is the indentation 52, the process proceeds to step S904.

【００８９】ステップＳ９０４では、連続性Ｃに１．０
を加えて連続性を大きくする。At step S904, the continuity C is set to 1.0.
To increase continuity.

【００９０】図５の例では、連続性Ｃは０．０に１．０
を加えて１．０となる。In the example of FIG. 5, the continuity C is 0.0 to 1.0.
To 1.0.

【００９１】その後、ステップＳ１０３で閾値と比較さ
れステップＳ１０４に進んで、基本文字領域２１と比較
文字領域２２は連続すると判定する。After that, the threshold value is compared with the threshold value in step S103, and the process proceeds to step S104 to determine that the basic character area 21 and the comparative character area 22 are continuous.

【００９２】同様に基本文字領域が文字領域２１、比較
文字領域が文字領域２３の場合を説明する。この場合
は、比較文字領域２３の最初が文字５３であり、字下げ
になってないのでステップＳ９０３でＮＯと判定し、ス
テップＳ１０２を終了してステップＳ１０３へ進む。Similarly, the case where the basic character area is the character area 21 and the comparison character area is the character area 23 will be described. In this case, the first character in the comparison character area 23 is the character 53, and the character is not indented. Therefore, it is determined to be NO in step S903, step S102 is terminated, and the process proceeds to step S103.

【００９３】連続性Ｃは、０．０のままなのでステップ
Ｓ１０５に進んで、基本文字領域２１と比較文字領域２
３は連続しないと判定する。Since the continuity C remains 0.0, the process proceeds to step S105, and the basic character area 21 and the comparison character area 2
It is determined that 3 is not continuous.

【００９４】図３のフローチャートに示す例では、文字
領域間の連続性を文字領域最後の矢印によって大きくし
たが、文字領域の最後が句点またはピリオドで終了して
いない場合に、文字領域の最後の文と比較する他の文字
領域の最初の文の、一つの文としての確からしさを用い
て、文字領域間の連続性を求めても良い。In the example shown in the flowchart of FIG. 3, the continuity between the character areas is increased by the arrow at the end of the character area. However, when the end of the character area does not end with a punctuation mark or a period, The continuity between the character regions may be obtained by using the certainty of the first sentence of the other character regions to be compared with the sentence as one sentence.

【００９５】ここで、文としての確からしさは、基本文
字領域の最後が名詞で終了している場合に、比較する他
の文字領域は最初の文が助詞で始まる時に文としての確
からしさを大きくする例について説明する。Here, the certainty as a sentence is that when the end of the basic character area ends with a noun, the other character areas to be compared have a greater certainty as a sentence when the first sentence starts with a particle. An example will be described.

【００９６】以下、連続性Ｃの求め方について詳細に説
明する。The method of obtaining the continuity C will be described in detail below.

【００９７】図６は図２に示す原稿画像の文字領域２
１、２２、２３に関して最初や最後の文字を示した図で
ある。図において、６１は、文字領域２１の最後の文字
部分かつ名詞「ロシア」である。６２は、文字領域２２
の最初の文字かつ助詞「が」である。６３は、文字領域
２３の最初の文字部分である。FIG. 6 shows a character area 2 of the original image shown in FIG.
It is the figure which showed the first or last character regarding 1,22,23. In the figure, 61 is the last character portion of the character area 21 and the noun "Russia". 62 is a character area 22
Is the first letter and particle "ga". 63 is the first character portion of the character area 23.

【００９８】図１０は、ステップＳ１０２についての本
例の詳細なフローチャートである。FIG. 10 is a detailed flowchart of this example regarding step S102.

【００９９】図１０のフローチャートに従って、ステッ
プＳ１０２を説明する。Step S102 will be described with reference to the flowchart of FIG.

【０１００】まず、ステップＳ１００１で連続性Ｃに
０．０を代入して初期化する。First, in step S1001, 0.0 is substituted for continuity C for initialization.

【０１０１】次にステップＳ１００２で、基本文字領域
の最後が句点またはピリオドか判定する。句点またはピ
リオドの場合はステップＳ１０２を終了する。句点また
はピリオドでない場合は、ステップＳ１００３に進む。Next, in step S1002, it is determined whether the end of the basic character area is a punctuation mark or a period. If it is a punctuation mark or a period, step S102 ends. If it is not a punctuation mark or a period, the process proceeds to step S1003.

【０１０２】図６の例では、基本文字領域を文字領域２
１とし、最後が句点でもピリオドでもないので、ステッ
プＳ１００３に進む。In the example of FIG. 6, the basic character area is the character area 2
Since it is 1 and the last is neither a punctuation mark nor a period, the process proceeds to step S1003.

【０１０３】ステップＳ１００３では、基本文字領域の
最後が名詞で終了しているか判定する。最後が名詞であ
れば、ステップＳ１００４に進む。名詞でなければ、ス
テップＳ１０２を終了する。In step S1003, it is determined whether the end of the basic character area ends with a noun. If the last is a noun, the process proceeds to step S1004. If it is not a noun, step S102 ends.

【０１０４】図６の例では、基本文字領域２１の最後が
名詞６１の「ロシア」なので、ステップＳ１００４に進
む。In the example of FIG. 6, since the last of the basic character area 21 is the noun 61 “Russia”, the process proceeds to step S1004.

【０１０５】ステップＳ１００４では、比較文字領域の
最初が助詞で始まるか判定する。助詞で始まっていれ
ば、ステップＳ１００５に進む。助詞で始まっていなけ
れば、ステップＳ１０２を終了する。In step S1004, it is determined whether the beginning of the comparison character area starts with a particle. If it starts with a particle, the process proceeds to step S1005. If it does not start with a particle, step S102 ends.

【０１０６】図６の例では、比較文字領域が文字領域２
２の場合は、初めが助詞６２の「が」で始まっているの
で、ステップＳ１００５に進む。In the example of FIG. 6, the comparison character area is the character area 2.
In the case of 2, the beginning begins with the particle 62 “ga”, and thus the process proceeds to step S1005.

【０１０７】ステップＳ１００５では、連続性Ｃに１．
０を加えて、連続性を大きくする。At step S1005, the continuity C is 1.
Add 0 to increase continuity.

【０１０８】図６の例では、連続性Ｃは０．０に１．０
を加えて１．０となる。ステップＳ１０３の判定の結果
ステップＳ１０４に進み、文字領域２１と２２は連続す
ると判定する。In the example of FIG. 6, the continuity C is 0.0 to 1.0.
To 1.0. As a result of the determination in step S103, the process proceeds to step S104, and it is determined that the character areas 21 and 22 are continuous.

【０１０９】同様に比較文字領域が、文字領域２３の場
合は初めが文字６３の「目」で始まっており、助詞では
ないのでステップＳ１０２を終了し、連続性Ｃは０．０
となり、文字領域２１と文字領域２３は連続しないと判
定する。Similarly, in the case where the comparison character area is the character area 23, the beginning is the "eye" of the character 63, and since it is not a particle, step S102 is ended and the continuity C is 0.0.
Therefore, it is determined that the character area 21 and the character area 23 are not continuous.

【０１１０】図３のフローチャートに示す例では、文字
領域間の連続性を文字領域最後の矢印によって大きくし
たが、文字領域の最後が句点またはピリオドで終了して
いない場合に、最後の文と比較する他の文字領域の最初
の文の、一つの文としての確からしさを用いて、文字領
域間の連続性を求めても良い。In the example shown in the flowchart of FIG. 3, the continuity between the character areas is increased by the arrow at the end of the character area. However, when the end of the character area is not terminated by a punctuation mark or a period, it is compared with the last sentence. The continuity between the character regions may be obtained by using the certainty of the first sentence of the other character region as one sentence.

【０１１１】ここで、文としての確からしさは、文字領
域の最後の文が主語を含むが述語を含まない場合に、比
較する他の文字領域の最初の文が主語を含まずかつ述語
を含む時に文としての確からしさを大きくする例につい
て説明する。Here, the certainty as a sentence is that when the last sentence of the character area includes the subject but does not include the predicate, the first sentence of the other character area to be compared does not include the subject and includes the predicate. An example of sometimes increasing the certainty as a sentence will be described.

【０１１２】以下、連続性Ｃの求め方について詳細に説
明する。The method of obtaining the continuity C will be described in detail below.

【０１１３】図７は、図２に示す原稿画像の文字領域２
１、２２、２３に関して最初や最後の文字を示した図で
ある。図において、７１は、文字領域２１の最後の文
「関連法案の整備が」である。７２は、文字領域２２の
最初の文「遅れた。」である。７３は、文字領域２３の
最初の文「最初の国はロシアとなる。」である。FIG. 7 shows the character area 2 of the original image shown in FIG.
It is the figure which showed the first or last character regarding 1,22,23. In the figure, 71 is the last sentence of the character area 21, "arrangement of related bill". Reference numeral 72 is the first sentence "delayed." In the character area 22. 73 is the first sentence of the character area 23, "The first country is Russia."

【０１１４】図１１は、ステップＳ１０２についての詳
細なフローチャートである。FIG. 11 is a detailed flowchart of step S102.

【０１１５】図１１のフローチャートに従って、ステッ
プＳ１０２を説明する。Step S102 will be described with reference to the flowchart of FIG.

【０１１６】まず、ステップＳ１１０１で連続性Ｃに
０．０を代入して初期化する。First, in step S1101, 0.0 is substituted for continuity C for initialization.

【０１１７】次にステップＳ１１０２で、基本文字領域
の最後が句点またはピリオドか判定する。句点またはピ
リオドの場合はステップＳ１０２を終了する。句点また
はピリオドでない場合は、ステップＳ１１０３に進む。Next, in step S1102, it is determined whether the end of the basic character area is a punctuation mark or a period. If it is a punctuation mark or a period, step S102 ends. If it is not a punctuation mark or a period, the process proceeds to step S1103.

【０１１８】図７の例では、基本文字領域を文字領域２
１とし、最後が句点でもピリオドでもないので、ステッ
プＳ１１０３に進む。In the example of FIG. 7, the basic character area is the character area 2
Since the last is neither a punctuation mark nor a period, the process advances to step S1103.

【０１１９】ステップＳ１１０３では、基本文字領域の
最後の文が主語を含みかつ述語を含まないか判定する。
主語を含みかつ述語を含まない場合は、ステップＳ１１
０４に進む。そうでない場合はステップＳ１０２を終了
する。In step S1103, it is determined whether the last sentence of the basic character area includes the subject and does not include the predicate.
If the subject is included and the predicate is not included, step S11
Go to 04. If not, step S102 ends.

【０１２０】図７の例では、基本文字領域２１の最後の
文７１が主語を含むが述語を含まないので、ステップＳ
１１０４に進む。In the example of FIG. 7, since the last sentence 71 of the basic character area 21 includes the subject but does not include the predicate, step S
Proceed to 1104.

【０１２１】ステップＳ１１０４で、比較文字領域の最
初の文が主語を含まずかつ述語を含むか判定する。主語
を含まずかつ述語を含む場合は、ステップＳ１１０５に
進む。そうでない場合は、ステップＳ１０２を終了す
る。In step S1104, it is determined whether the first sentence of the comparison character area does not include a subject and a predicate. If the subject is not included and the predicate is included, the process proceeds to step S1105. Otherwise, step S102 ends.

【０１２２】図７の例では、比較文字領域を文字領域２
２とした場合、比較文字領域２２の最初の文７２が主語
を含まず述語を含むので、ステップＳ１１０５に進む。In the example of FIG. 7, the comparison character area is the character area 2
In the case of 2, since the first sentence 72 of the comparison character area 22 does not include the subject but the predicate, the process proceeds to step S1105.

【０１２３】ステップＳ１１０５で、連続性Ｃに１．０
を加えて連続性を大きくする。At step S1105, the continuity C is set to 1.0.
To increase continuity.

【０１２４】図７の例では、連続性Ｃは０．０に１．０
を加えて１．０となる。ステップＳ１０３の判定の結果
ステップＳ１０４に進み、文字領域２１と２２は連続す
ると判定する。In the example of FIG. 7, the continuity C is 0.0 to 1.0.
To 1.0. As a result of the determination in step S103, the process proceeds to step S104, and it is determined that the character areas 21 and 22 are continuous.

【０１２５】同様に比較文字領域を文字領域２３にした
場合は、ステップＳ１１０４において最初の文７３が、
主語を含んでいるためステップＳ１０２を終了する。連
続性Ｃは、０．０となりステップＳ１０３の判定の結果
ステップＳ１０５に進み、文字領域２１と文字領域２３
と連続しないと判定する。Similarly, when the comparison character area is the character area 23, the first sentence 73 in step S1104 is
Since the subject is included, step S102 ends. The continuity C becomes 0.0, and as a result of the determination in step S103, the process proceeds to step S105, and the character areas 21 and 23
Is determined not to be continuous.

【０１２６】図３のフローチャートに示す例では、文字
領域間の連続性を文字領域最後の矢印によって大きくし
たが、文字領域の最後が句点またはピリオドで終了して
いない場合に、最後の文と比較する他の文字領域の最初
の文の、一つの文としての確からしさを用いて、文字領
域間の連続性を求めても良い。In the example shown in the flowchart of FIG. 3, the continuity between the character areas is increased by the arrow at the end of the character area. However, when the end of the character area is not terminated by a punctuation mark or a period, it is compared with the last sentence. The continuity between the character regions may be obtained by using the certainty of the first sentence of the other character region as one sentence.

【０１２７】ここで、文としての確からしさは、文字領
域の最後の文が主語を含むが述語を含まない場合に、比
較する他の文字領域の最初の文が主語を含まずかつ述語
を含む時に文としての確からしさを大きくする際に、主
語と述語の関連度を用いて文としての確からしさを求め
る例について説明する。Here, the certainty as a sentence is that, when the last sentence of the character area includes the subject but does not include the predicate, the first sentence of the other character area to be compared does not include the subject and includes the predicate. An example of finding the certainty as a sentence by using the degree of association between the subject and the predicate when increasing the certainty as a sentence will be described.

【０１２８】以下、連続性Ｃの求め方について詳細に説
明する。The method of obtaining the continuity C will be described in detail below.

【０１２９】図８は、図２に示す原稿画像の文字領域２
１、２２、２３に関して最初や最後の文字を示した図で
ある。図において、８１は、文字領域２１の最後の文
「関連法案の整備が」である。８２は、文字領域２２の
最初の文「遅れる。」である。８３は、文字領域２３の
最初の文「走る。」である。FIG. 8 shows the character area 2 of the original image shown in FIG.
It is the figure which showed the first or last character regarding 1,22,23. In the figure, reference numeral 81 is the last sentence of the character area 21, "arrangement of related bill". Reference numeral 82 is the first sentence “delay.” In the character area 22. Reference numeral 83 is the first sentence "run." In the character area 23.

【０１３０】図１２は、ステップＳ１０２についての詳
細なフローチャートである。FIG. 12 is a detailed flowchart of step S102.

【０１３１】図１３は、主語と述語の関連度データの一
部である。FIG. 13 is a part of the relevance data of the subject and the predicate.

【０１３２】図１２のフローチャートに従って、ステッ
プＳ１０２を説明する。Step S102 will be described with reference to the flowchart of FIG.

【０１３３】まず、ステップＳ１２０１で連続性Ｃに
０．０を代入して初期化する。First, in step S1201, 0.0 is substituted for continuity C for initialization.

【０１３４】次にステップＳ１２０２で、基本文字領域
の最後が句点またはピリオドか判定する。句点またはピ
リオドの場合はステップＳ１０２を終了する。句点また
はピリオドでない場合は、ステップＳ１２０３に進む。Next, in step S1202, it is determined whether the end of the basic character area is a punctuation mark or a period. If it is a punctuation mark or a period, step S102 ends. If it is not a punctuation mark or a period, the process advances to step S1203.

【０１３５】図８の例では、基本文字領域を文字領域２
１とし、最後が句点でもピリオドでもないので、ステッ
プＳ１２０３に進む。In the example of FIG. 8, the basic character area is the character area 2
Since the value is 1 and the last is neither a punctuation mark nor a period, the process advances to step S1203.

【０１３６】ステップＳ１２０３では、基本文字領域の
最後の文が主語を含みかつ述語を含まないか判定する。
主語を含みかつ述語を含まない場合は、ステップＳ１２
０４に進む。そうでない場合はステップＳ１０２を終了
する。In step S1203, it is determined whether the last sentence of the basic character area includes the subject and does not include the predicate.
If the subject is included and the predicate is not included, step S12
Go to 04. If not, step S102 ends.

【０１３７】図８の例では、基本文字領域２１の最後の
文８１が主語を含むが述語を含まないので、ステップＳ
１２０４に進む。In the example of FIG. 8, since the last sentence 81 of the basic character area 21 includes the subject but does not include the predicate, step S
Proceed to 1204.

【０１３８】ステップＳ１２０４で、比較文字領域の最
初の文が主語を含まずかつ述語を含むか判定する。主語
を含まずかつ述語を含む場合は、ステップＳ１２０５に
進む。そうでない場合は、ステップＳ１０２を終了す
る。In step S1204, it is determined whether the first sentence of the comparison character area does not include the subject and does not include the predicate. If the subject is not included and the predicate is included, the process proceeds to step S1205. Otherwise, step S102 ends.

【０１３９】図８の例では、比較文字領域を文字領域２
２とした場合、比較文字領域２２の最初の文８２が主語
を含まず述語を含むので、ステップＳ１２０５に進む。In the example of FIG. 8, the comparison character area is the character area 2
In the case of 2, since the first sentence 82 of the comparison character area 22 does not include the subject but the predicate, the process proceeds to step S1205.

【０１４０】ステップＳ１２０５で、連続性Ｃに主語と
述語の関連度を加えて連続性を大きくする。In step S1205, the continuity is increased by adding the degree of association between the subject and the predicate to the continuity C.

【０１４１】図８の例では、主語が「整備が」で、述語
が「遅れる」であるので図１３に示す関連度データか
ら、関連度は１．２５であることが分かる。そこで、連
続性Ｃは０．０に１．２５を加えて１．２５となる。ス
テップＳ１０３の判定の結果ステップＳ１０４に進み、
文字領域２１と２２は連続すると判定する。In the example of FIG. 8, the subject is “maintenance” and the predicate is “delayed”, so it can be seen from the relevance data shown in FIG. 13 that the relevance is 1.25. Therefore, the continuity C becomes 1.25 by adding 1.25 to 0.0. As a result of the determination in step S103, the process proceeds to step S104,
It is determined that the character areas 21 and 22 are continuous.

【０１４２】同様に比較文字領域を文字領域２３にした
場合は、主語が「整備が」で、最初の文８３の述語が
「走る」であるので、図１３に示す関連度データにそれ
らの関連度が載っていない。載っていない場合は、関連
度０．０なので、連続性Ｃは０．０を加えて０．０のま
まである。ステップＳ１０３の判定の結果ステップＳ１
０５に進み、文字領域２１と文字領域２３は連続しない
と判定する。Similarly, when the comparison character area is set to the character area 23, the subject is “maintenance” and the predicate of the first sentence 83 is “run”. Therefore, those relations are shown in the relation degree data shown in FIG. The degree is not listed. If not listed, the relevance is 0.0, so the continuity C remains at 0.0 with 0.0 added. As a result of the determination in step S103, step S1
In step 05, it is determined that the character area 21 and the character area 23 are not continuous.

【０１４３】図３のフローチャートに示す例では、文字
領域間の連続性を文字領域最後の矢印によって大きくし
たが、文字領域間に共通して存在する単語または類義語
の存在を用いて、文字領域間の連続性を求めても良い。In the example shown in the flowchart of FIG. 3, the continuity between the character areas is increased by the arrow at the end of the character areas. However, by using the existence of a word or synonym commonly existing between the character areas, the character areas are separated from each other. May be required to be continuous.

【０１４４】以下、連続性Ｃの求め方について詳細に説
明する。The method of obtaining the continuity C will be described in detail below.

【０１４５】図１５は、図２に示す原稿画像の文字領域
２１、２２、２３に関して全ての文字を示した図であ
る。FIG. 15 is a diagram showing all characters in the character areas 21, 22, and 23 of the original image shown in FIG.

【０１４６】図１６は、ステップＳ１０２についての詳
細なフローチャートである。FIG. 16 is a detailed flowchart of step S102.

【０１４７】図１６のフローチャートに従って、ステッ
プＳ１０２を説明する。Step S102 will be described with reference to the flowchart of FIG.

【０１４８】まず、ステップＳ１６０１で連続性Ｃに
０．０を代入して初期化する。First, in step S1601, 0.0 is substituted for continuity C for initialization.

【０１４９】次にステップＳ１６０２で、比較文字領域
内で、基本文字領域の単語と同一または類義語の、比較
文字領域内の総単語数に対する割合を出してＣに加え、
連続性を大きくする。Next, in step S1602, in the comparison character area, the ratio of the same or synonymous words as the words in the basic character area to the total number of words in the comparison character area is calculated and added to C,
Increase continuity.

【０１５０】図１５の例では、比較文字領域を文字領域
２２とした時、比較文字領域内２２で、基本文字領域２
１の単語と同一または類義語を取り出し、数を数える
と、「経済」が３個、「改革」が１個、「ロシア」の類
義語として「旧ソ連諸国」が１個、存在する。「ロシ
ア」は基本文字領域内に２単語存在するので２個として
カウントすると、合計は３＋１＋２＝６となる。In the example of FIG. 15, when the comparison character area is the character area 22, the basic character area 2 is within the comparison character area 22.
Taking out the same or synonymous words as 1 and counting the numbers, there are three "economy", one "reform", and one "former Soviet Union" as a synonym for "Russia". Since "Russia" has two words in the basic character area, when counted as two words, the total is 3 + 1 + 2 = 6.

【０１５１】比較文字領域２２の総単語数は２３なの
で、その割合は、６÷２３＝０．２６となる。Since the total number of words in the comparison character area 22 is 23, the ratio is 6 ÷ 23 = 0.26.

【０１５２】０．２６を連続性Ｃに加えて０．２６とな
る。0.26 is added to the continuity C to obtain 0.26.

【０１５３】以上でステップＳ１０２を終了する。Thus, step S102 is completed.

【０１５４】次にステップＳ１０３で、閾値αと連続性
Ｃを比較する。閾値α以上であれば、ステップＳ１０４
に進み、未満であれば、ステップＳ１０５に進む。ただ
し、ここで閾値は０．２０とする。Next, in step S103, the threshold value α is compared with the continuity C. If it is greater than or equal to the threshold value α, step S104
If it is less than, go to step S105. However, the threshold is 0.20 here.

【０１５５】図１５の例では、連続性Ｃは０．２６なの
でステップＳ１０４に進む。In the example of FIG. 15, since the continuity C is 0.26, the process proceeds to step S104.

【０１５６】ステップＳ１０４で、基本文字領域と比較
文字領域は連続すると判定する。In step S104, it is determined that the basic character area and the comparison character area are continuous.

【０１５７】同様に、比較文字領域が文字領域２３の場
合について説明すると、基本文字領域２１の単語と同一
または類義語を取り出し、数を数えると、「経済」が２
個、「改革」が１個、存在する。従って合計は２＋１＝
３となる。Similarly, the case where the comparison character area is the character area 23 will be described. When the same word as the word in the basic character area 21 or a synonym is taken out and the number is counted, "economy" is 2
There is one “reform”. Therefore, the total is 2 + 1 =
It becomes 3.

【０１５８】比較文字領域２３の総単位数は１９なの
で、その割合は、３÷１９＝０．１６となる。Since the total number of units in the comparison character area 23 is 19, the ratio is 3 ÷ 19 = 0.16.

【０１５９】０．１６を連続性Ｃに加えて０．１６とな
る。0.16 is added to the continuity C to give 0.16.

【０１６０】以上でステップＳ１０２を終了する。Thus, step S102 is completed.

【０１６１】次にステップＳ１０３で、閾値αと連続性
Ｃを比較する。連続性Ｃは０．１６なのでステップＳ１
０５に進む。Next, in step S103, the threshold value α is compared with the continuity C. Continuity C is 0.16, so step S1
Go to 05.

【０１６２】ステップＳ１０５で、基本文字領域と比較
文字領域は連続しないと判定する。In step S105, it is determined that the basic character area and the comparison character area are not continuous.

【０１６３】図３のフローチャートに示す例では、文字
領域間の連続性を文字領域最後の矢印によって大きくし
たが、文字領域間の類似度を用いて求めても良い。In the example shown in the flowchart of FIG. 3, the continuity between the character areas is increased by the arrow at the end of the character areas, but it may be obtained by using the similarity between the character areas.

【０１６４】ここで、文字領域間の類似度を文章表現の
類似度、特に丁寧さに関する類似度を用いて求める例に
ついて説明する。Here, an example will be described in which the similarity between the character areas is obtained by using the similarity of the sentence expression, particularly the similarity regarding politeness.

【０１６５】以下、連続性Ｃの求め方について詳細に説
明する。The method of obtaining the continuity C will be described in detail below.

【０１６６】図１７は、ステップＳ１０２についての詳
細なフローチャートである。FIG. 17 is a detailed flowchart of step S102.

【０１６７】図１７のフローチャートに従って、ステッ
プＳ１０２を説明する。Step S102 will be described with reference to the flowchart of FIG.

【０１６８】まず、ステップＳ１７０１で連続性Ｃに
０．０を代入して初期化する。First, in step S1701, 0.0 is substituted for continuity C for initialization.

【０１６９】次にステップＳ１７０２で、基本文字領域
と比較文字領域の文章表現の丁寧さに関する類似度を出
してＣに加え、連続性を大きくする。Next, in step S1702, the similarity regarding the politeness of the text representation of the basic character area and the comparison character area is calculated and added to C to increase continuity.

【０１７０】例えば、尊敬語、謙譲語、丁寧語の辞書を
持ち、基本文字領域内の文章に存在するそれらの割合Ｘ
と、比較文字領域内に存在するそれらの割合Ｙを求め
て、基本文字領域と比較文字領域間の類似度Ｓを以下の
式で求める。For example, it has a dictionary of respected words, humble words, and polite words, and their ratio X existing in sentences in the basic character area is X.
Then, the ratio Y of those existing in the comparison character area is obtained, and the similarity S between the basic character area and the comparison character area is obtained by the following formula.

【０１７１】Ｓ＝１．０−（ＸとＹの差）…（１）次に類似度Ｓを連続性Ｃに加える。S = 1.0- (difference between X and Y) (1) Next, the similarity S is added to the continuity C.

【０１７２】この求めた連続性Ｃが、閾値αよりも大き
ければ、基本文字領域と比較文字領域は連続すると判定
される。If the continuity C thus obtained is larger than the threshold value α, it is determined that the basic character area and the comparison character area are continuous.

【０１７３】具体的に考えると基本文字領域内の尊敬
語、謙譲語、丁寧語の割合が０．２５、比較文字領域内
の尊敬語、謙譲語、丁寧語の割合が０．３とすると類似
度Ｓは、Ｓ＝１．０−（０．３−０．２５）＝０．９５である。Considering concretely, the ratio of respected words, humble words, and polite words in the basic character area is 0.25, and the ratio of honorific words, humble words, and polite words in the comparative character area is 0.3, which is similar. The degree S is S = 1.0- (0.3-0.25) = 0.95.

【０１７４】類似度Ｓを連続性Ｃに加えて、連続性Ｃは
Ｃ＝０．９５となる。By adding the similarity S to the continuity C, the continuity C becomes C = 0.95.

【０１７５】以上で、ステップＳ１０２を終了し、ステ
ップＳ１０３に進む。With the above, step S102 is ended, and the process proceeds to step S103.

【０１７６】ステップＳ１０３で、閾値αと比較しα以
上であれば、ステップＳ１０４に進む。未満であれば、
ステップＳ１０５に進む。ただし、ここでは閾値αは
０．８とする。In step S103, the value is compared with the threshold value α, and if α or more, it proceeds to step S104. If less than
It proceeds to step S105. However, the threshold value α is 0.8 here.

【０１７７】この例では、連続性Ｃが閾値以上なのでス
テップＳ１０４に進み、基本文字領域と比較文字領域は
連続すると判定する。In this example, since the continuity C is not less than the threshold value, the process proceeds to step S104, and it is determined that the basic character area and the comparison character area are continuous.

【０１７８】本実施例の順序付け方法を表すフローチャ
ートは図１に示す第一の実施例と同様である。The flowchart showing the ordering method of this embodiment is the same as that of the first embodiment shown in FIG.

【０１７９】以上の説明により、前述の第一の実施例と
同様の作用、効果が得られる。From the above description, the same operation and effect as those of the above-mentioned first embodiment can be obtained.

【０１８０】図１７のフローチャートに示す例では、文
字領域間の類似度は、文章表現のうちの丁寧さに関する
類似度を用いて類似度を出しているが、類似度は行末表
現に関する類似度を用いても良い。In the example shown in the flowchart of FIG. 17, the similarity between the character areas is obtained by using the similarity regarding the politeness of the sentence expression, but the similarity indicates the similarity regarding the line end expression. You may use.

【０１８１】例えば、基本文字領域と比較文字領域内の
文章から、行末表現が「ですます体」である割合をそれ
ぞれ出し、式（１）によって文字領域間の類似度を出し
ても良い。For example, from the sentences in the basic character area and the comparison character area, the proportions in which the end-of-line expression is “masamasu” may be calculated, respectively, and the similarity between the character areas may be calculated by the expression (1).

【０１８２】次に類似度Ｓを連続性Ｃに加える。Next, the similarity S is added to the continuity C.

【０１８３】この求めた連続性Ｃが、閾値αよりも大き
ければ、基本文字領域と比較文字領域は連続すると判定
される。If the obtained continuity C is larger than the threshold value α, it is determined that the basic character area and the comparison character area are continuous.

【０１８４】具体的に考えると基本文字領域内の「です
ます体」の割合が０．５、比較文字領域内の「ですます
体」の割合が０．４とすると類似度Ｓは、Ｓ＝１．０−（０．５−０．４）＝０．９である。Specifically, if the ratio of “Damasuma body” in the basic character area is 0.5 and the ratio of “Damasuma body” in the comparison character area is 0.4, the similarity S is S = 1.0- (0.5-0.4) = 0.9.

【０１８５】類似度Ｓを連続性Ｃに加えて、連続性Ｃは
Ｃ＝０．９となる。By adding the similarity S to the continuity C, the continuity C becomes C = 0.9.

【０１８６】これは、閾値αよりも大きいので基本文字
領域と比較文字領域は連続すると判定する。ただし、こ
こで閾値αは０．８とする。Since this is larger than the threshold value α, it is determined that the basic character area and the comparison character area are continuous. However, the threshold value α is 0.8 here.

【０１８７】図１７のフローチャートに示す例では、文
字領域間の類似度は、文章表現のうちの丁寧さに関する
類似度を用いて類似度を出しているが、類似度は待遇表
現に関する類似度を用いても良い。In the example shown in the flowchart of FIG. 17, the similarity between the character areas is obtained by using the similarity regarding the politeness of the sentence expression, but the similarity indicates the similarity regarding the treatment expression. You may use.

【０１８８】ここで、待遇表現とは話題の人物に体する
話し手の、尊敬・親愛・軽侮などの態度を表す言語表現
をさす。Here, the treatment expression is a linguistic expression that represents the attitude of a speaker who is a person in the topic, such as respect, dearness, and contempt.

【０１８９】例えば、基本文字領域と比較文字領域内の
文章から、待遇表現の割合をそれぞれ出し、式（１）に
よって文字領域間の類似度を出しても良い。For example, the proportion of the treatment expression may be calculated from the sentences in the basic character area and the comparison character area, and the similarity between the character areas may be calculated by the expression (1).

【０１９０】次に類似度Ｓを連続性Ｃに加える。Next, the similarity S is added to the continuity C.

【０１９１】この求めた連続性Ｃが、閾値αよりも大き
ければ、基本文字領域と比較文字領域は連続すると判定
される。If the obtained continuity C is larger than the threshold value α, it is determined that the basic character area and the comparison character area are continuous.

【０１９２】具体的に考えると基本文字領域内の待遇表
現の割合が０．２、比較文字領域内の待遇表現の割合が
０．３とすると類似度Ｓは、Ｓ＝１．０−（０．３−０．２）＝０．９である。Specifically, if the proportion of treatment expressions in the basic character area is 0.2 and the proportion of treatment expressions in the comparison character area is 0.3, the similarity S is S = 1.0- (0 .3-0.2) = 0.9.

【０１９３】類似度Ｓを連続性Ｃに加えて、連続性Ｃは
Ｃ＝０．９となる。By adding the similarity S to the continuity C, the continuity C becomes C = 0.9.

【０１９４】これは、閾値αよりも大きいので基本文字
領域と比較文字領域は連続すると判定する。ただし、こ
こで閾値αは０．８とする。Since this is larger than the threshold value α, it is determined that the basic character area and the comparison character area are continuous. However, the threshold value α is 0.8 here.

【０１９５】本実施例の順序付け方法を表すフローチャ
ートは図１に示す第一の実施例と同様である。The flowchart showing the ordering method of this embodiment is the same as that of the first embodiment shown in FIG.

【０１９６】以上の説明により、前述の第一の実施例と
同様の作用、効果が得られる。From the above description, the same operation and effect as in the first embodiment described above can be obtained.

【０１９７】図１７のフローチャートに示す例では、文
字領域間の類似度は、文章表現のうちの丁寧さに関する
類似度を用いて類似度を出しているが、類似度は文字領
域内の漢字、ひらがな、カタカナ、記号、数字、英字等
のジャンル毎の構成割合を用いて求めても良い。In the example shown in the flowchart of FIG. 17, the similarity between the character areas is obtained by using the similarity regarding the politeness of the sentence expression, but the similarity is the kanji in the character area. It is also possible to use the composition ratio for each genre such as hiragana, katakana, symbols, numbers, and letters.

【０１９８】例えば、基本文字領域と比較文字領域内の
文章から、総文字数に対する漢字の割合をそれぞれ出
し、式（１）によって文字領域間の類似度を出しても良
い。For example, the ratio of Chinese characters to the total number of characters may be calculated from the sentences in the basic character area and the comparison character area, and the similarity between the character areas may be calculated by the equation (1).

【０１９９】次に類似度Ｓを連続性Ｃに加える。Next, the similarity S is added to the continuity C.

【０２００】この求めた連続性Ｃが、閾値αよりも大き
ければ、基本文字領域と比較文字領域は連続すると判定
される。If the continuity C thus obtained is larger than the threshold value α, it is determined that the basic character area and the comparison character area are continuous.

【０２０１】具体的に考えると基本文字領域内の漢字の
割合が０．４、比較文字領域内の漢字の割合が０．３と
すると類似度Ｓは、Ｓ＝１．０−（０．４−０．３）＝０．９である。Specifically, if the proportion of Chinese characters in the basic character area is 0.4 and the proportion of Chinese characters in the comparative character area is 0.3, the similarity S is S = 1.0- (0.4 -0.3) = 0.9.

【０２０２】類似度Ｓを連続性Ｃに加えて、連続性Ｃは
Ｃ＝０．９となる。By adding the similarity S to the continuity C, the continuity C becomes C = 0.9.

【０２０３】これは、閾値αよりも大きいので基本文字
領域と比較文字領域は連続すると判定する。ただし、こ
こで閾値αは０．８とする。Since this is larger than the threshold value α, it is determined that the basic character area and the comparison character area are continuous. However, the threshold value α is 0.8 here.

【０２０４】尚、上述のような総文字数に対する漢字の
割合のみを用いて類似度を出す例に替えて、他のジャン
ルでももちろん良い。また、幾つかのジャンルを組み合
わせても良い。Note that other genres may be used instead of the example in which the degree of similarity is obtained using only the ratio of Chinese characters to the total number of characters as described above. Also, several genres may be combined.

【０２０５】尚、上述のような文字領域間の類似度は、
文章表現のうち丁寧さに関する類似度を用いて類似度を
出す例に替えて、類似度は文字領域内の文字画像から求
めた、文字大きさ、行長さ、文字ピッチ、行ピッチ、文
字間の隙間、行間の隙間を用いて求めても良い。The similarity between the character areas as described above is
Instead of an example in which similarity is calculated using the degree of similarity in politeness, the similarity is calculated from the character image in the character area, such as character size, line length, character pitch, line pitch, and character spacing. It is also possible to obtain it by using the gaps and the gaps between the rows.

【０２０６】例えば、基本文字領域と比較文字領域内の
文字画像から、文字大きさの平均をそれぞれ出し、Ｘ、
Ｙとすると類似度Ｓは、Ｓ＝１．０−（ＸとＹの差）÷β…（２）次に類似度Ｓを連続性Ｃに加える。式（２）で、βは定
数である。For example, from the character images in the basic character area and the comparison character area, the average of the character size is calculated, and X,
If the degree of similarity is Y, the degree of similarity S is S = 1.0- (difference between X and Y) / β (2) Next, the degree of similarity S is added to the continuity C. In Expression (2), β is a constant.

【０２０７】この求めた連続性Ｃが、閾値αよりも大き
ければ、基本文字領域と比較文字領域は連続すると判定
される。If the continuity C thus obtained is larger than the threshold value α, it is determined that the basic character area and the comparison character area are continuous.

【０２０８】具体的に考えると基本文字領域内の文字大
きさの平均が６４．５ドット、比較文字領域内の文字大
きさの平均が５９．３ドットとすると類似度Ｓは、Ｓ＝１．０−（６４．５−５９．３）÷１００＝０．９
５である。Specifically, if the average character size in the basic character area is 64.5 dots and the average character size in the comparative character area is 59.3 dots, the similarity S is S = 1. 0- (64.5-59.3) ÷ 100 = 0.9
It is 5.

【０２０９】類似度Ｓを連続性Ｃに加えて、連続性Ｃは
Ｃ＝０．９５となる。The similarity C is added to the continuity C, and the continuity C becomes C = 0.95.

【０２１０】これは、閾値αよりも大きいので基本文字
領域と比較文字領域は連続すると判定する。ただし、こ
こで閾値αは０．８、定数βは１００とする。Since this is larger than the threshold value α, it is determined that the basic character area and the comparison character area are continuous. However, the threshold value α is 0.8 and the constant β is 100 here.

【０２１１】尚、上述のような文字大きさを用いて文字
領域間の類似度を出す例に替えて、行長さ、文字ピッ
チ、行ピッチ、文字間の隙間、行間の隙間を用いてもも
ちろん良い。また、それらのいくつかを組み合わせても
良い。Note that line length, character pitch, line pitch, gaps between characters, and gaps between lines may be used instead of the example in which the similarity between character regions is obtained using the character size as described above. Of course good. Also, some of them may be combined.

【０２１２】尚、上述のような文字領域間の類似度は、
文章表現のうちの丁寧さに関する類似度を用いて類似度
を出す例に替えて、類似度は文字領域内の文字画像から
求めたフォントの違いを用いて求めても良い。The similarity between the character areas as described above is
Instead of an example in which the degree of similarity is calculated by using the degree of politeness in the sentence expression, the degree of similarity may be obtained by using the font difference obtained from the character image in the character area.

【０２１３】例えば、基本文字領域と比較文字領域内の
文字画像から、総文字数に対する明朝体の文字数の割合
をそれぞれ出し、式（１）によって文字領域間の類似度
を求めても良い。For example, from the character images in the basic character area and the comparison character area, the ratio of the number of characters in Mincho typeface to the total number of characters may be obtained, and the similarity between the character areas may be obtained by the equation (1).

【０２１４】次に類似度Ｓを連続性Ｃに加える。Next, the similarity S is added to the continuity C.

【０２１５】この求めた連続性Ｃが、閾値αよりも大き
ければ、基本文字領域と比較文字領域は連続すると判定
される。If the continuity C thus obtained is larger than the threshold value α, it is determined that the basic character area and the comparison character area are continuous.

【０２１６】具体的に考えると基本文字領域内の明朝体
の割合が０．９、比較文字領域内の明朝体の割合が０．
９１とすると類似度Ｓは、Ｓ＝１．０−（０．９１−０．９）＝０．９９である。Specifically, the ratio of Mincho typeface in the basic character region is 0.9, and the ratio of Mincho typeface in the comparative character region is 0.
When the similarity is 91, the similarity S is S = 1.0- (0.91-0.9) = 0.99.

【０２１７】類似度Ｓを連続性Ｃに加えて、連続性Ｃは
Ｃ＝０．９９となる。Adding the similarity S to the continuity C, the continuity C becomes C = 0.99.

【０２１８】これは、閾値αよりも大きいので基本文字
領域と比較文字領域は連続すると判定する。ただし、こ
こで閾値αは０．９とする。Since this is larger than the threshold value α, it is determined that the basic character area and the comparison character area are continuous. However, the threshold value α is 0.9 here.

【０２１９】尚、上述のような総文字数に対する明朝体
の文字数の割合を用いて文字領域間の類似度を出す例に
替えて、例えば基本文字領域に使われているフォント
は、ゴシックＢＢＢ体であり、比較文字領域に使われて
いるフォントは、標準幅ゴシック体であるから、類似度
０、つまり連続性ＣはＣ＝０．０として、基本文字領域
と比較文字領域は連続しないと判定しても良い。Note that instead of the example of obtaining the similarity between the character areas using the ratio of the number of characters in Mincho type to the total number of characters as described above, for example, the font used in the basic character area is a Gothic BBB type. Since the font used for the comparison character area is a standard width Gothic font, it is determined that the similarity is 0, that is, the continuity C is C = 0.0, and the basic character area and the comparison character area are not continuous. You may.

【０２２０】尚、上述の例のように総文字数に対する明
朝体の文字数の割合を用いて文字領域間の類似度を出す
例に替えて、明朝体の代わりにゴシック体、教科書体は
もちろん、斜体や細明朝体、太明朝体等のフォントとし
ても良い。Note that instead of the example in which the similarity between character regions is obtained by using the ratio of the number of characters in Mincho type to the total number of characters as in the above example, it goes without saying that Gothic type and textbook type are used instead of Mincho type. The font may be italic, Hosyo, or Taichung.

【０２２１】尚、上述のような文字領域間の類似度は、
文章表現のうちの丁寧さに関する類似度を用いて類似度
を出す例に替えて、類似度は文字領域内の文字画像から
求めた文字または行の傾斜の違いを用いて求めても良
い。The similarity between the character areas as described above is
Instead of an example in which the degree of similarity is calculated using the degree of politeness in the text expression, the degree of similarity may be obtained using the difference in the inclination of the character or line obtained from the character image in the character area.

【０２２２】例えば、基本文字領域と比較文字領域内の
文字画像から、傾斜角度をそれぞれ出し、式（２）によ
って文字領域間の類似度を求めても良い。For example, the inclination angles may be respectively obtained from the character images in the basic character area and the comparison character area, and the similarity between the character areas may be obtained by the equation (2).

【０２２３】次に求めた類似度Ｓを連続性Ｃに加える。Next, the calculated similarity S is added to the continuity C.

【０２２４】この求めた連続性Ｃが、閾値αよりも大き
ければ、基本文字領域と比較文字領域は連続すると判定
される。If the continuity C thus obtained is larger than the threshold value α, it is determined that the basic character area and the comparison character area are continuous.

【０２２５】具体的に考えると基本文字領域の傾斜角度
が０．１度、比較文字領域の傾斜角度が１．０度とする
と類似度Ｓは、Ｓ＝１．０−（１．０−０．１）÷９０＝０．９９であ
る。Specifically, if the inclination angle of the basic character area is 0.1 degrees and the inclination angle of the comparison character area is 1.0 degrees, the similarity S is S = 1.0- (1.0-0 .1) /90=0.99.

【０２２６】類似度Ｓを連続性Ｃに加えて、連続性Ｃは
Ｃ＝０．９９となる。これは、閾値αよりも大きいので
基本文字領域と比較文字領域は連続すると判定する。た
だし、ここで閾値αは０．９、定数βは９０とする。Adding the similarity S to the continuity C, the continuity C becomes C = 0.99. Since this is larger than the threshold value α, it is determined that the basic character area and the comparison character area are continuous. However, here, the threshold value α is 0.9 and the constant β is 90.

【０２２７】尚、上述のような文字領域間の類似度は、
文章表現のうちの丁寧さに関する類似度を用いて類似度
を出す例に替えて、類似度は文字領域内の文字画像から
求めた組方向の違いを用いて求めても良い。The similarity between the character areas as described above is
Instead of an example in which the degree of similarity is calculated using the degree of politeness in the text expression, the degree of similarity may be obtained using the difference in the set direction obtained from the character image in the character area.

【０２２８】例えば、基本文字領域と比較文字領域内の
文字画像から、総文字数に対する縦書きの文字数の割合
をそれぞれ出し、式（１）によって文字領域間の類似度
を求めても良い。For example, the ratio of the number of vertically written characters to the total number of characters may be calculated from the character images in the basic character area and the comparison character area, and the degree of similarity between the character areas may be calculated by the equation (1).

【０２２９】次に求めた類似度Ｓを連続性Ｃに加える。Next, the calculated similarity S is added to the continuity C.

【０２３０】この求めた連続性Ｃが、閾値αよりも大き
ければ、基本文字領域と比較文字領域は連続すると判定
される。If the obtained continuity C is larger than the threshold value α, it is determined that the basic character area and the comparison character area are continuous.

【０２３１】具体的に考えると基本文字領域内の縦書き
の割合が０．９、比較文字領域内の縦書きの割合が０．
９１とすると類似度Ｓは、Ｓ＝１．０−（０．９１−０．９）＝０．９９である。More specifically, the ratio of vertical writing in the basic character area is 0.9, and the ratio of vertical writing in the comparison character area is 0.
When the similarity is 91, the similarity S is S = 1.0- (0.91-0.9) = 0.99.

【０２３２】類似度Ｓを連続性Ｃに加えて、連続性Ｃは
Ｃ＝０．９９となる。これは、閾値αよりも大きいので
基本文字領域と比較文字領域は連続すると判定する。た
だし、ここで閾値αは０．９とする。By adding the similarity S to the continuity C, the continuity C becomes C = 0.99. Since this is larger than the threshold value α, it is determined that the basic character area and the comparison character area are continuous. However, the threshold value α is 0.9 here.

【０２３３】尚、上述のように総文字数に対する縦書き
の文字数の割合を用いて文字領域間の類似度を出す例に
替えて、例えば基本文字領域の組方向は、縦書きであ
り、比較文字領域の組方向は、横書きであるから、類似
度０、つまり連続性ＣはＣ＝０．０として、基本文字領
域と比較文字領域は連続しないと判定しても良い。Note that, instead of the example in which the ratio of the number of vertically written characters to the total number of characters is used to obtain the similarity between the character regions as described above, for example, the set direction of the basic character region is vertical writing, Since the set direction of the areas is horizontal writing, the similarity is 0, that is, the continuity C is C = 0.0, and it may be determined that the basic character area and the comparison character area are not continuous.

【０２３４】尚、上述のように総文字数に対する縦書き
の文字数の割合を用いて文字領域間の類似度を出す例に
替えて、総行数に対する縦書きの行数の割合を用いても
良い。Note that the ratio of the number of vertically written lines to the total number of lines may be used instead of the example in which the similarity between character regions is obtained by using the ratio of the number of vertically written characters to the total number of characters as described above. .

【０２３５】（実施例２）先の実施例１では、２つの文
字領域を対象として連続するか否か判定したが、本実施
例では対象とする領域を３領域以上にした場合の連続性
の判定について説明する。(Embodiment 2) In the first embodiment, it is determined whether or not two character areas are consecutive, but in the present embodiment, the continuity of the case where the target area is three or more areas is determined. The determination will be described.

【０２３６】図１４は、本実施例の順序付け方法を表す
フローチャートである。FIG. 14 is a flowchart showing the ordering method of this embodiment.

【０２３７】図１４に従って、本実施例の文字領域の連
続するか否かの判定方法を説明する。ステップＳ１０２
に関しては図１と同じであり、先の実施例において詳細
に説明した、連続性Ｃの判定方法を本実施例においても
用いる。A method of determining whether or not the character areas are continuous in this embodiment will be described with reference to FIG. Step S102
1 is the same as that of FIG. 1, and the method of determining the continuity C described in detail in the previous embodiment is also used in this embodiment.

【０２３８】まず初めにステップＳ１４０１で連続する
か調べたい基となる基本文字領域を取り出す。First, in step S1401, a basic character area which is a base to be checked for continuity is extracted.

【０２３９】図５の例について説明すると、基本文字領
域を文字領域２１とする。Explaining the example of FIG. 5, the basic character area is the character area 21.

【０２４０】次にステップＳ１４０２で、基本文字領域
に対して連続するか比較するひとつ以上の比較文字領域
を取り出す。Next, in step S1402, one or more comparison character areas that are continuous or are compared with the basic character area are extracted.

【０２４１】図５の例では、文字領域２２と２３を比較
文字領域とする。In the example of FIG. 5, the character areas 22 and 23 are comparison character areas.

【０２４２】ステップＳ１４０３では、比較文字領域を
ひとつ取り出す。In step S1403, one comparison character area is extracted.

【０２４３】図５の例では、まず初めに文字領域２２を
比較文字領域として取り出す。In the example of FIG. 5, first, the character area 22 is taken out as a comparison character area.

【０２４４】ステップＳ１０２で、基本文字領域から比
較文字領域への連続性Ｃを求める。ステップＳ１０２の
詳細は、図９に示す通りであり、第２の実施例で示した
のとまったく同じである。At step S102, the continuity C from the basic character area to the comparative character area is obtained. Details of step S102 are as shown in FIG. 9, and are exactly the same as those shown in the second embodiment.

【０２４５】図５の例では、文字領域２２に関しては、
連続性Ｃは１．０となる。In the example of FIG. 5, regarding the character area 22,
The continuity C is 1.0.

【０２４６】次にステップＳ１４０４で、全ての比較文
字領域に関して連続性Ｃを求めたか判定する。連続性を
全て求めていれば、ステップＳ１４０５に進む。残って
いれば、ステップＳ１４０３に戻り、ステップＳ１４０
２で求めた領域の内連続性Ｃを求めていない領域を一つ
取り出し、連続性を求める処理を続ける。Next, in step S1404, it is determined whether the continuity C has been obtained for all the comparison character areas. If all the continuity is obtained, the process proceeds to step S1405. If any remain, the process returns to step S1403 and step S140.
One region in which the continuity C of the regions obtained in 2 is not obtained is taken out, and the process for obtaining continuity is continued.

【０２４７】図５の例では、文字領域２３が残っている
ので、ステップＳ１４０３に戻って処理を続ける。文字
領域２２と同様に連続性を求めると、第２の実施例で説
明したように連続性Ｃは０．０となる。これで、全ての
比較文字領域に関して連続性を求めたので、ステップＳ
１４０５に進む。In the example of FIG. 5, since the character area 23 remains, the process returns to step S1403 to continue the processing. When the continuity is calculated similarly to the character area 22, the continuity C is 0.0 as described in the second embodiment. Now that the continuity is obtained for all the comparison character areas, step S
Proceed to 1405.

【０２４８】ステップＳ１４０５では、連続性Ｃが最も
大きい比較文字領域へ連続すると判定する。In step S1405, it is determined that the comparison character area having the largest continuity C is continuous.

【０２４９】図５の例では、比較文字領域２２への連続
性Ｃが１．０、比較文字領域２３への連続性Ｃが０．０
なので、基本文字領域２１に連続する文字領域は文字領
域２２であると判定する。In the example of FIG. 5, the continuity C to the comparison character area 22 is 1.0 and the continuity C to the comparison character area 23 is 0.0.
Therefore, it is determined that the character area continuous with the basic character area 21 is the character area 22.

【０２５０】[0250]

【発明の効果】以上説明した様に、本発明によれば、二
つ以上の文字領域が連続するか否か判断する際に、文字
領域内の文字または文章または画像を解析して求めた文
字領域の連続性を用いることによって文字領域に読み順
を付ける事により、新聞記事等、原稿の中に複数の記事
が存在する場合や位置からでは正しく順番を付けること
ができない場合でも、正しく読み順を付けることがで
き、修正等の手間を削減できる効果がある。As described above, according to the present invention, when determining whether or not two or more character areas are continuous, a character obtained by analyzing a character or a sentence or an image in the character area is determined. By adding the reading order to the text area by using the continuity of the area, even if there are multiple articles in the manuscript such as newspaper articles or if the order cannot be set correctly from the position, the reading order is correct. Can be attached, and the effect of reducing the trouble such as correction can be achieved.

[Brief description of drawings]

【図１】実施例１の順序付け処理を表すフローチャー
ト。FIG. 1 is a flowchart illustrating an ordering process according to a first exemplary embodiment.

【図２】原稿画像の一例を示す図。FIG. 2 is a diagram showing an example of a document image.

【図３】図２に示す原稿画像の文字領域２１の最後の文
字が矢印である場合の例示図。FIG. 3 is an exemplary view when the last character in a character area 21 of the document image shown in FIG. 2 is an arrow.

【図４】ステップＳ１０２についての詳細なフローチャ
ート。FIG. 4 is a detailed flowchart of step S102.

【図５】図２に示す原稿画像の文字領域２１、２２、２
３に関して最初や最後の文字を示した図。FIG. 5 is a character area 21, 22, 2 of the original image shown in FIG.
The figure which showed the first and the last character regarding 3.

【図６】図２に示す原稿画像の文字領域２１、２２、２
３に関して最初や最後の文字を示した図。FIG. 6 is a character area 21, 22, 2 of the original image shown in FIG.
The figure which showed the first and the last character regarding 3.

【図７】図２に示す原稿画像の文字領域２１、２２、２
３に関して最初や最後の文字を示した図。FIG. 7 is a diagram showing the character areas 21, 22, and 2 of the original image shown in FIG.
The figure which showed the first and the last character regarding 3.

【図８】図２に示す原稿画像の文字領域２１、２２、２
３に関して最初や最後の文字を示した図。FIG. 8 is a character area 21, 22, 2 of the original image shown in FIG.
The figure which showed the first and the last character regarding 3.

【図９】ステップＳ１０２についての詳細なフローチャ
ート。FIG. 9 is a detailed flowchart of step S102.

【図１０】ステップＳ１０２についての詳細なフローチ
ャート。FIG. 10 is a detailed flowchart of step S102.

【図１１】ステップＳ１０２についての詳細なフローチ
ャート。FIG. 11 is a detailed flowchart of step S102.

【図１２】ステップＳ１０２についての詳細なフローチ
ャート。FIG. 12 is a detailed flowchart of step S102.

【図１３】主語と述語の関連度データの一部を示す図。FIG. 13 is a diagram showing a part of relevance data of a subject and a predicate.

【図１４】実施例２の順序付け処理を表すフローチャー
ト。FIG. 14 is a flowchart showing an ordering process according to the second embodiment.

【図１５】図２に示す原稿画像の文字領域２１、２２、
２３に関して全ての文字を示した図。15 is a diagram showing the character areas 21 and 22 of the original image shown in FIG.
The figure which showed all the characters regarding 23.

【図１６】ステップＳ１０２についての詳細なフローチ
ャート。FIG. 16 is a detailed flowchart of step S102.

【図１７】ステップＳ１０２についての詳細なフローチ
ャート。FIG. 17 is a detailed flowchart of step S102.

【図１８】本実施例の装置の構成を表すブロック図。FIG. 18 is a block diagram showing the configuration of an apparatus according to this embodiment.

Claims

[Claims]

1. A document image is stored, region information relating to at least two character regions existing in the document image is stored, and sentences included in two character regions of the character region are continuous. An image processing method characterized by determining whether or not.

2. The image processing method according to claim 1, wherein the determination as to whether the sentences are continuous is made by analyzing the sentences included in the two character areas.

3. The image processing method according to claim 1, wherein the area information relating to the stored character area is information obtained by separating the stored document image into areas.

4. The analysis of sentences included in the character area is performed by
The image processing method according to claim 2, wherein the image information included in the character area is analyzed for a character obtained by character recognition.

5. The image processing method according to claim 1, wherein whether or not the sentences included in the two character areas are continuous is determined by an index of continuity.

6. The image processing according to claim 5, wherein when the arrow exists at the end of the character area, the continuity index is increased with respect to the character area existing in the direction indicated by the arrow. Method.

7. The continuity index is set to be much larger in a character area that is staggered first when a symbol indicating the end of a sentence is present at the end of the character area. The image processing method described in.

8. The image processing method according to claim 7, wherein the symbol indicating the end of the sentence is a punctuation mark.

9. The image processing method according to claim 7, wherein the symbol indicating the end of the sentence is a period.

10. The index of continuity is determined by using the certainty of the last sentence of the one character region and the first sentence of the other character region as one sentence. Item 6. The image processing method according to Item 5.

11. The image processing method according to claim 10, wherein the continuity index is determined when the end of the character area is not a symbol indicating the end of a sentence.

12. The certainty as one sentence is,
12. The likelihood as one sentence is increased when the first sentence of the other character region starts with a particle when the end of the one character region ends with a noun. Image processing method.

13. The certainty as the one sentence is,
When the last sentence of one character area contains a subject but does not contain a predicate, when the first sentence of the other character area contains no subject and contains a predicate, it increases the certainty as a sentence. The image processing method according to claim 11, which is characterized in that.

14. The certainty as the one sentence is,
The image processing method according to claim 13, wherein the subject included in the one character area and the predicate included in the other character area are used to obtain the object.

15. The determination as to whether the sentence is continuous is made by
The image processing method according to claim 1, wherein the determination is performed by using the existence ratio of words or synonyms that commonly exist in both character areas.

16. The determination as to whether or not the sentence is continuous,
The image processing method according to claim 1, wherein the image area is obtained by using the similarity between the character areas.

17. The image processing method according to claim 16, wherein the similarity between the character areas is determined with respect to a text expression between the character areas.

18. The image processing method according to claim 17, wherein the text expression is polite.

19. The image processing method according to claim 17, wherein the text expression is a line ending expression.

20. The image processing method according to claim 17, wherein the text expression is a treatment expression.

21. The image processing method according to claim 16, wherein the similarity between the character areas is determined based on a composition ratio for each genre in the character area.

22. The image processing method according to claim 21, wherein the genre is Chinese characters.

23. The image processing method according to claim 21, wherein the genre is hiragana.

24. The image processing method according to claim 21, wherein the genre is katakana.

25. The image processing method according to claim 21, wherein the genre is a symbol.

26. The image processing method according to claim 21, wherein the genre is a number.

27. The image processing method according to claim 21, wherein the genre is an alphabetic character.

28. The similarity between the character areas is determined based on a format in the character area.
6. The image processing method according to item 6.

29. The image processing method according to claim 28, wherein the format is a font.

30. The image processing method according to claim 28, wherein the format is a character size.

31. The image processing method according to claim 28, wherein the format is a line length.

32. The image processing method according to claim 28, wherein the format is a character pitch.

33. The image processing method according to claim 28, wherein the format is a line pitch.

34. The image processing method according to claim 28, wherein the format is an inclination of a character.

35. The image processing method according to claim 28, wherein the format is line inclination.

36. The image processing method according to claim 28, wherein the format is a space between characters.

37. The image processing method according to claim 28, wherein the format is a space between lines.

38. The image processing method according to claim 28, wherein the format is a set direction.