JPH01196685A

JPH01196685A - Method for detecting character

Info

Publication number: JPH01196685A
Application number: JP63019595A
Authority: JP
Inventors: Masatoshi Okada; 岡田　正年; Akiko Konno; 紺野　章子
Original assignee: Fuji Electric Co Ltd
Current assignee: Fuji Electric Co Ltd
Priority date: 1988-02-01
Filing date: 1988-02-01
Publication date: 1989-08-08
Anticipated expiration: 2012-01-08
Also published as: JP2569103B2

Abstract

PURPOSE:To detect a side point, a side line, a phonetic KANA (Japanese syllabary) symbol and an underline and to improve reading performance by using respective pitches between two adjoining character strings (or character rows) and the width of the character string (or character row). CONSTITUTION:By using W0 and W1 for the pitch between respective segmenting character strings (or character rows) and the width of respective character strings (or character rows), ordinary character strings C0, C2-C4, C6 (or rows) and strings C1 and C5 (or rows) including the side point, the side line, the phonetic KANA symbol and an underline are separated. Thereafter, from the projecting result obtained in the direction vertical to the side point, side line, the symbol and underline column, an individual element is detected and the position and respective character positions of the character strings (or character rows) are compared. Thus, the character with the side point, side line, symbol and underline, etc., can be detected and the reading performance is improved.

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明は、公知の画像処理技術を利用して縦書または
横書の文書、特に傍点、傍線、ルビまたはアンダーライ
ンを含む文書から、これらが付された文字を検出するた
めの方法に関する。Detailed Description of the Invention [Industrial Application Field] The present invention utilizes known image processing techniques to convert vertically or horizontally written documents, particularly documents containing dots, lines, ruby, or underlines. This invention relates to a method for detecting characters with .

[Conventional technology]

従来、文書中から文字列または文字行を切出す方法とし
ては、傍点、傍線、ルビまたはアンダーラインの存在を
特に意識せずに文字切出しを行ない、切出された文字列
または文字行の幅を対象となる文字領域の標準文字の大
きさと比較し、これが所定幅以下のときはその文字列ま
たは文字行をノイズを含むものとして除去する方法が知
られている。なお、縦書文章と横書文章における傍点。Conventionally, the method of cutting out character strings or character lines from a document is to cut out characters without paying particular attention to the presence of subpoints, parallel lines, rubies, or underlines, and then adjust the width of the extracted character strings or character lines. A method is known in which a character string or line is compared with the standard character size of a target character area, and if the size is less than a predetermined width, the character string or character line is removed as containing noise. In addition, side points in vertical and horizontal writing texts.

傍線、ルビまたはアンダーラインの関係を図示すると第
７図の如くなる。The relationship between side lines, ruby, and underlines is illustrated in FIG. 7.

[Problem to be solved by the invention]

しかしながら、このような方法ではルビやアンダーライ
ンを含む列または行は、その幅から云って除去の対象と
なってしまい、しかも−度除去されてしまうと、それら
に関する情報は全く失なわれてしまうことになる。′＝
また、特にルビの場合、所定幅以上の幅をもつものは除
去の対象とはならないが、その切出し結果の文字列また
は文字行はルビではない他の文字列または文字行と全く
同等、すなわち独立した１つの文字列または文字行とし
て扱われること〜なる。本来、ルビは他の文字列または
文字行に付随するものでｓｂ、それ単独″で意味をなす
ものではないので、従来の方法を使用して文書の読取り
ｅ行なうと、結果として文脈上無意味な行があちこちに
挿入されてしまうことになる。However, with this method, columns or rows containing rubies or underlines are subject to removal due to their width, and once they are removed, all information about them is lost. It turns out. ′＝
In addition, especially in the case of ruby, items with a width greater than a specified width are not subject to removal, but the resulting character string or character line is exactly the same as other non-ruby character strings or character lines, that is, it is independent. It will be treated as one character string or line of characters. Originally, ruby is attached to other character strings or character lines and has no meaning on its own, so if you read a document using the conventional method, the result will be meaningless in terms of context. rows will be inserted here and there.

このように、傍点、傍線、ルビ、アンダーライン等は跡
形もなく消滅してしまうか、あるいは他と全く独立した
１個の（しかも無意味な）列または行として存在するか
のいずれかでめった。In this way, dots, lines, rubies, underlines, etc. either disappear without a trace or exist as a single (and meaningless) column or row completely independent of the others. .

したがって、この発明は傍点、傍線、ルビまたはアンダ
ーライン等の情報が失なわれないようにすると〜もに、
これらが付された文字の検出を可能にし、文字読取り性
能を向上させることを目的とする。Therefore, this invention prevents information such as side points, side lines, ruby, or underlines from being lost, and
The purpose is to enable detection of characters with these marks attached and to improve character reading performance.

[Means to solve the problem]

画像処理装置によυ対象となる文書領域内の文字列（ま
たは文字行）を切出し、隣接する２つの文字列（または
文字行）間の各ピッチから標準ピッチを求め、該標準ピ
ッチから求められるピッチのしきい値と各文字列（ｉた
は文字行）間のピッチとを比較し、しきい値以下のピッ
チをもつ２つの文字列（または文字行）の幅をそれぞれ
標準文字の大きさと比較して一方の文字列（または文字
行）の幅だけが所定値以下のとき、幅の狭い方の文字列
（または文字行）ｆ、傍点、傍線、ルビまたはアンダー
ラインを含む列（または行）として検出した後、この列
（または行）から個々の要素の切出しを行なう一方、文
字列または文字行からも個々の文字の切出しを行ない、
両者の結果を比較することにより傍点、傍線、ルビまた
はアンダーラインが付されている文字を検出する。An image processing device cuts out a character string (or character line) within the target document area, calculates a standard pitch from each pitch between two adjacent character strings (or character lines), and calculates the standard pitch from the standard pitch. Compare the pitch threshold and the pitch between each character string (i or character line), and calculate the width of the two character strings (or character lines) with a pitch below the threshold with the standard character size. When the width of only one character string (or character line) is less than a predetermined value after comparison, the character string (or character line) with the narrower width f, the column (or line) containing the side dot, side line, ruby, or ), the individual elements are extracted from this column (or row), and the individual characters are also extracted from the character string or character line,
By comparing the two results, characters with side dots, side lines, ruby, or underlines are detected.

[Effect]

各切出し文字列（ｉたは文字行）間のピッチおよび各文
字列（または文字行）の幅を利用して通常の文字列（ま
たは文字行）と傍点、傍線、ルビまたはアンダーライン
を含む列（または行）とを分離した後、傍点、傍線、ル
ビまたはアンダーライン列に対しその垂直方向にとった
投影結果よシ個々の要素の位置を検出し、この位置と文
字列（または文字行）の各文字位置とを比較することに
より、傍点、傍線、ルビまたはアンダーライン等の付く
文字を検出できるようにし、読取シ性能を向上させる。A column containing normal character strings (or character lines) and side points, side lines, ruby, or underlines by using the pitch between each cut character string (i or character line) and the width of each character string (or character line) (or line), then detect the position of each element based on the projection result taken in the vertical direction with respect to the side point, side line, ruby, or underline column, and detect this position and the character string (or character line). By comparing the position of each character, characters with side dots, lines, ruby, underlines, etc. can be detected, thereby improving reading performance.

〔Example〕

第１図はこの発明の実施例を示す概略フローチャート、
第２図はその詳細を示すフローチャートで、以下、第２
図に沿って説明する。なお、具体的な例として第３図の
縦書文章を考える。FIG. 1 is a schematic flowchart showing an embodiment of the present invention;
Figure 2 is a flowchart showing the details.
This will be explained according to the diagram. As a specific example, consider the vertically written text in FIG.

■（第２図の■に対応、以下同様）文字列の切出し結果
の切出し座標（開始座標Ａｉ、終了座標Ｂ、）をもとに
、文字列の中心軸間の距離（ピッチ）Ｐｉを、ｐｉ−（Ａｉ＋１＋ｎｉ）／２　（Ａ４＋Ｊ）／２によ
り求める（第３図のＰ。−Ｐ５参照）。■ (Corresponds to ■ in Figure 2, the same applies hereafter) Based on the extraction coordinates (start coordinate Ai, end coordinate B,) of the character string extraction result, calculate the distance (pitch) Pi between the central axes of the character string, It is determined by pi-(Ai+1+ni)/2 (A4+J)/2 (see P.-P5 in FIG. 3).

■得られたピッチＰｉからその平均値Ｐ＆（Ｐａ−、召
。Ｐｌ／Ｎ）″または中央呟もしくは最頻値を求め、そ
れを標準ピッチとする。(2) From the obtained pitch Pi, find its average value P & (Pa-, pitch Pl/N)'' or its median value or mode, and use it as the standard pitch.

■Ｐ１の直からＰｔｈ””αＰａ（α：定足数によって
ピッチのしきい値Ｐｔｈｔ求め、Ｐｔｈと各ピッチＰｉ
とを比較する。■ Pth””αPa (α: Find the pitch threshold value Ptht from the quorum directly from P1, Pth and each pitch Pi
Compare with.

■２文字列の幅をそれぞれ求める。■Find the width of each of the two character strings.

■しさい［Ｐｔｈ以下のピッチをもつ２文字列に対し、
それぞれの文字列の幅を標準文字の大きさをもとに得ら
れる所定の幅しさい匝と比較する。■Shishai [For two character strings with pitch less than Pth,
The width of each character string is compared with a predetermined width obtained based on the standard character size.

■比較した結果、２つの文字列のうち１方の文字列の幅
のみが所定幅に満たない場合、その２つの文字列を通常
の文字列と傍点、　ＩＪＳ、ルビあるいはアンダーライ
ン（以下、ルビ、アンダーライン等ともいう。）のＡ且
であるとみなす。■As a result of the comparison, if the width of only one of the two character strings is less than the specified width, the two character strings are converted into normal character strings, dots, IJS, ruby, or underline (hereinafter referred to as ruby). , underline, etc.).

以上の操作を、例えば第３図のごとく文字列の切出しが
行なわれた場合について考える。ピッチＰｏ％Ｐ５によ
シ例えば平均値Ｐ、とそのしきい値Ｐｔｈが求められ、
ＰｔｈとＰ。−Ｂ５をそれぞれ比較して、こへでは小さ
いピッチＰ　およびＢ４を検出する。ピッチＰ。は文字
列Ｃ８と文字列Ｃ１の間のピッチ、Ｂ４は文字列Ｃ４と
文字列Ｃ５の間のピッチである。こうして幅の狭いピッ
チを検出したら、次はその両端の文字列の＠を調べるこ
とになる。例えばピッチＰ０についてであれば、文字列
Ｃ８と文字列Ｃ４の＠を調べる。この場合、文字列Ｃ８
の幅Ｗ。−Ｂｏ−Ａｏに比して文字列１０幅Ｗ１””Ｂ
１−△１は小さいのでＷ。が標準文字の大きさ程度の幅
をもつ通常文字列であれば、適当なしきいｌ（例えば、
標準文字サイズの０．８倍）によって文字列Ｃ６はしき
い値以上、文字列Ｃ１はしきい値以下と判定され、これ
ら２つの文字列は通常の文字列と傍点、傍線、ルビ、ア
ンダーラインとの組であるとされる。文字列Ｃ４と文字
列Ｃ５に対しても、同様の手順で検出が行なわれること
になる。Consider the case where the above operation is performed, for example, when a character string is cut out as shown in FIG. For example, the average value P and its threshold value Pth are determined based on the pitch Po%P5,
Pth and P. -B5 are compared, and the smaller pitches P and B4 are detected here. Pitch P. is the pitch between character string C8 and character string C1, and B4 is the pitch between character string C4 and character string C5. Once a narrow pitch is detected in this way, the next step is to check the @ characters in the string at both ends. For example, in the case of pitch P0, check @ in character string C8 and character string C4. In this case, the string C8
Width W. -Character string 10 width W1""B compared to Bo-Ao
1-△1 is small, so W. If is a regular character string with a width about the size of a standard character, an appropriate threshold l (for example,
(0.8 times the standard character size), character string C6 is determined to be above the threshold value, and character string C1 is determined to be below the threshold value. It is said to be a pair with. The same procedure will be used to detect character strings C4 and C5.

こうして検出された通常文字列とルビ、アンダーライン
等の組に対して、以下の手順によシルビ。For the normal character string, ruby, underline, etc. pair detected in this way, the following procedure is used to create a silvi character string.

アンダーライン等の位置検出が行なわれる。Position detection of underlines, etc. is performed.

■２つの文字列のうち、幅の広い方の文字列（第３図で
いえば、文字列Ｃ６や文字列Ｃ，）に対しては、文字切
出しアルゴリズムに従って文字切出しを行なう。なお、
文字切出しアルゴリズムについてはよく知られているの
で、ここでは省略する。(2) Characters are extracted from the wider character string (character string C6 and character string C, in FIG. 3) according to the character extraction algorithm. In addition,
Since the character extraction algorithm is well known, it will be omitted here.

０幅の狭い方の文字列（第３図でいう文字列ｃ１や文字
列Ｃ５）に対しては、傍点、傍線、ルビ。For character strings with a narrower 0 width (character string c1 and character string C5 in Figure 3), side dots, side lines, and ruby characters are used.

アンダーライン文字列であると考えられるので、傍点、
傍線、ルビ、アンダーラインの切出しを行なう。この切
出しも通常の文字列と同様、文字切出しアルゴリズムに
従う。ただし、この場合、ルビの文字を１文字１文字切
出すことはせず、文字間隔（文字の終了位置と次の文字
の開始位置との距離：第４Ｂ図のＤｒ）が所定の大きさ
以下であるものについては、１つの単語にふられている
ルビ１ｉとｔｂとみなし、ルビの最初の文字の開始位置
（第４Ｂ図のＸｒ）および最後の文字の終了位置（第４
Ｂ図のＹｒ）をもってルビの位置とする。ここで、ルビ
、傍点列と傍線、アンダーライン列とは列の投影（線の
本数または黒点の長さの投影）によシ区別する。また、
ルビと傍点とは線の本数で区別する。Since it is considered to be an underlined string,
Extracts side lines, ruby lines, and underlines. This extraction follows the same character extraction algorithm as a normal character string. However, in this case, the ruby characters are not cut out one by one, and the character spacing (distance between the end position of a character and the start position of the next character: Dr in Figure 4B) is less than or equal to the specified size. , the ruby 1i and tb mentioned in one word are considered, and the start position of the first character of the ruby (Xr in Figure 4B) and the end position of the last character (the 4th
Yr) in Figure B is the ruby position. Here, ruby, side dot columns, side lines, and underline columns are distinguished by column projection (projection of the number of lines or length of black dots). Also,
Ruby and side points are distinguished by the number of lines.

■通常文字の切出しによる各文字の位置とルビ。■The position and ruby of each character by cutting out regular characters.

アンダーライン等の切出しによるルビ、アンダーライン
等の位置との比較を行ない、ルビまたはアンダーライン
等の引かれている文字を見つけ出す。A comparison is made with the position of the ruby, underline, etc. by cutting out the underline, etc., and the characters on which the ruby, underline, etc. are drawn are found.

具体的な操作手順を、第４Ａ図のような切出し結果が得
られた場合を例にとって説明する。たｙし、と〜ではル
ビだけが付された例を示す。The specific operating procedure will be explained by taking as an example a case where a cutting result as shown in FIG. 4A is obtained. However, and ~ show an example in which only ruby is added.

■−１）各文字の開始座標Ｓ、（第４Ｃ図のＳ。〜Ｓ５
）を値の小さなものから順にルビ開始座標Ｘｒと比較し
ていき１．最初にＸｒ＜８．となるＳｉを見つける。第
４Ｃ図でいえば、これはＳ、となる。■-1) Start coordinate S of each character, (S in Figure 4C.~S5
) with the ruby start coordinates Xr in descending order of value. 1. First, Xr<8. Find Si. In FIG. 4C, this is S.

■−２）ＸｒとＳｌおよび３．−、との距離Ｄｌ、Ｄｉ
−。■-2) Xr and Sl and 3. −, distance Dl, Di
−.

を算出する（第４Ｃ図のＤ３．Ｄ２）。(D3.D2 in Figure 4C).

■−５）ＤｌとＤト、全比較し、Ｄｉ−、＜　Ｄｉならば、５ｉ−１を開始座標としても
つ文字を、Ｄｉ−、〉Ｄｉならば、５ｉを開始座標としてもつ文字
を、それぞれルビのふられている単語の最初の文字とする。■-5) Compare all of Dl and D, and if Di-, < Di, then the character with 5i-1 as the starting coordinate, and if Di-, > Di, the character with 5i as the starting coordinate, respectively. It is the first letter of the ruby word.

第４Ｃ図の例でいうと、Ｄ２〈Ｄ、であるので、Ｂ２を
開始座標としてもつ「漢」が最初の文字となる。In the example of FIG. 4C, since D2<D, "Kan" having B2 as the starting coordinate becomes the first character.

■−４）同様の比較を文字の終了座標Ｔ１とルビ。■-4) Similar comparison is made between the character end coordinate T1 and ruby.

アンダーライン等の終了座標Ｙｒに対しても行ない、最
初にＹｒくＴ１となるＴＩを見つけ、開始座標の場合と
同様にＥ、、Ｅ、、、−１’に比較することによって、
こメではルビのふられている単語の最後の文字を見つけ
る。第４Ｄ図の例でいうと、検出−ｋｆべてのルビ、ア
ンダーライン等に対して行なう。Do this also for the end coordinates Yr of the underline, etc., first find the TI that is Yr minus T1, and compare it with E, , E, , -1' as in the case of the start coordinates,
Find the last letter of the ruby word here. In the example of FIG. 4D, detection-kf is performed for all rubies, underlines, etc.

以上の如く操作を行ない、ルビ、アンダーライン等に対
しての位置情報を得る。第５図は文字認職にこの発明に
よる方法を適用し、得られた認識結果においてルビのふ
られている単語の前後にルビマークを挿入するという処
理を加えて結果を出力した例でおる。傍点マーク、ｆｊ
Ｉ線マークまたはアンダーラインマークについても同様
な処理が行なわれる。なお、第５図と対応する入力文書
を第６図に示す。Perform the operations as described above to obtain position information for ruby, underline, etc. FIG. 5 shows an example in which the method of the present invention is applied to character recognition, and the results are output after adding a process of inserting ruby marks before and after words marked with ruby in the recognition results obtained. Side mark, fj
Similar processing is performed for I-line marks or underline marks. Note that FIG. 6 shows an input document corresponding to FIG. 5.

〔Effect of the invention〕

この発明によれば、傍点、傍線、ルビまたはアンダーラ
インを検出することによ）これらの情報が失なわれるの
を防ぐとへもに各々の位置を検出し、その検出結果とこ
れらが本来付くべき文字列（または文字行）の文字切出
し結果との比較を行なうことにより、どの文字に傍点、
傍線、ルビまたはアンダーラインのいずれが付されてい
るかを検出するようにしたので、読取シ性能が著しく向
上すると云う利点がもたらされる。According to the present invention, by detecting side points, side lines, rubies, or underlines, this information is prevented from being lost. By comparing the character extraction results of the desired character string (or character line), which character has a side point,
Since it is detected whether a side line, ruby, or underline is attached, there is an advantage that reading performance is significantly improved.

[Brief explanation of the drawing]

第１図はこの発明の実施例を示す概略フ四−チヤード、
第２図はその詳細を示すフローチャート、第３図は各文
字列のピッチの求め方の具体例を説明するための説明図
、第４Ａ図ないし第４Ｄ図は傍点、傍線、ルビまたはア
ンダーラインの付く文字を検出する方法の具体例を説明
するための説明図、第５図はルビの付されている文字を
検出してルビマークを付した例を説明するための説明図
、第６図は第５図と対応する文章例全示す説明図、第７
図は縦書文、横書文と傍点、傍線、ルビまたはアンダー
ラインとの関係を説明するための説明図である。符号説明Ａｌ・・・・・・文字列の切出し開始座標、Ｂｉ・・・
・・・文字、Ｐｉ・・・・・・文字列の距離（ピッチ）
、Ｐａ・・・・・・標準ピッチ”　ｔｈ・・・・・・ピ
ッチのしきい呟、Ｃ，・・・・・・文字列、ＷＯｅ　Ｗ
ｌ・・・・・・文字列の幅、Ｘｒ・・・・・・ルビの最
初の文字の開始位置、Ｙｒ・・・・・・ルビの最後の文
字の終了位置、Ｓ、・・・・・・各文字の開始座漂、Ｔ
Ｉ・・・・・・各文字の終了座標。代理人　弁理士　並　木　昭　夫代理人　弁理士　松　崎　　　　清算　１　図に２１ｉ！＋５３　図第４Ａ図１４Ｂ図１Ω図FIG. 1 is a schematic diagram showing an embodiment of the present invention;
Figure 2 is a flowchart showing the details, Figure 3 is an explanatory diagram to explain a specific example of how to find the pitch of each character string, Figures 4A to 4D are FIG. 5 is an explanatory diagram for explaining a specific example of a method for detecting characters with ruby marks. FIG. 5 is an explanatory diagram for explaining an example of detecting characters with ruby marks and adding ruby marks. Explanatory diagram showing all text examples corresponding to Figure 5, No. 7
The figure is an explanatory diagram for explaining the relationship between a vertical text, a horizontal text, and a side point, a side line, a ruby, or an underline. Code explanation Al...Character string cutting start coordinates, Bi...
...Character, Pi...Distance of character string (pitch)
, Pa...Standard pitch" th...Pitch threshold, C,...Character string, WOe W
l... Width of the character string, Xr... Start position of the first character of ruby, Yr... End position of the last character of ruby, S,...・Start of each letter, T
I... Ending coordinates of each character. Agent Patent Attorney Akio Namiki Agent Patent Attorney Matsuzaki Liquidation 1 21i in Figure! + 53 Figure 4A Figure 14B Figure 1Ω diagram

Claims

[Claims]

1) An image processing device cuts out a character string (or character line) within the target document area, and extracts two adjacent character strings (
Find the standard pitch from each pitch between each character string (or character line), compare the pitch threshold found from the standard pitch with the pitch between each character string (or character line), and find the pitch between each character string (or character line). Compare the widths of two character strings (or character lines) with the standard character size, and if only the width of one character string (or character line) is less than the specified value, the width of the narrower character string (or character line) is compared with the standard character size. After detecting a row) as a column (or row) containing a side point, side line, ruby, or underline, the side point, side line, ruby, or underline, etc. is extracted from the column (or row), while character strings (or characters line)
A character detection method characterized in that characters with side points, side lines, ruby, underlines, etc. are detected by extracting individual characters from the .