JPH03268065A

JPH03268065A - Article extracting system

Info

Publication number: JPH03268065A
Application number: JP2067177A
Authority: JP
Inventors: Teruo Akiyama; 秋山　照雄
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1990-03-19
Filing date: 1990-03-19
Publication date: 1991-11-28
Anticipated expiration: 2013-05-13
Also published as: JP2749425B2

Abstract

PURPOSE:To exactly segment plural articles into individual article by deriving a keyword containing in a document constituted of plural articles at every block of a character-string, and deciding to which keyword a block belongs, based on similarity of these keywords. CONSTITUTION:To a central processing unit 17, an image input part 18, storage device 19, input/output terminal 20, character-string block extracting mechanism 21, character segmenting mechanism 22, a character recognizing mechanism 23, keyword extracting mechanism 24, block deciding mechanism 25 and an article extracting mechanism 26 are connected. In such a state, a block being a unit for constituting an article is extracted from an input document image, and keywords are extracted from a character-string of each block thereof, respectively. Subsequently, similarity of each keyword is compared and based on a result of deciding whether each block constitutes the same article or not, a document is segmented into every article. In such a way, the document constituted of plural articles can be segmented exactly into individual article.

Description

【発明の詳細な説明】［発明の目的］（産業上の利用分野）本発明は、複数の記事から構成される文書を個々の記事
毎に領域を分離し抽出する記事抽出方式に関する。DETAILED DESCRIPTION OF THE INVENTION [Object of the Invention] (Industrial Application Field) The present invention relates to an article extraction method for separating and extracting regions for each article from a document composed of a plurality of articles.

（従来の技術）複数の記事から構成される新聞等の文書を個々の記事に
切り分けるためには、個々の記事を楕成するブロックの
つながり関係を調べる必要がある。(Prior Art) In order to separate a document such as a newspaper consisting of a plurality of articles into individual articles, it is necessary to examine the connections among the blocks that make up the individual articles.

例えば、第２図に示すように、文字列のブロック７が文
字列のブロック３と接続していれば、見出し文字列ブロ
ック１と本文の文字列ブロック２゜３．７が記事人とな
り、見出し４と本文５，６が記事Ｂとなる。逆に、文字
列ブロック７が文字列ブロック６とつながり関係を持っ
ている場合には、文字列ブロック７は記事Ｂに属するこ
とになる。For example, as shown in Figure 2, if character string block 7 is connected to character string block 3, headline character string block 1 and body character string block 2°3.7 become article authors, and the headline 4 and the main text 5 and 6 become article B. Conversely, if character string block 7 has a connection relationship with character string block 6, character string block 7 belongs to article B.

ここで、８．９はブロックの境界を明確にするために使
用されている罫線である。Here, 8.9 is a ruled line used to clarify the boundaries of blocks.

このようなブロックのつながり関係を調べる手法として
は、特公昭６１−３２７１２に記載されているように、
つながる可能性のあるブロック、第２図ではブロック３
または６の最後の行を調べ、その行の最後が句点相当の
小量黒画素で終わっているか否かを判定し、句点で終了
していないときは必ず次の段に記事の続きが存在すると
いう仮定をおき、次の段で右端の先頭に空白の無い行が
上の記事の続きの先頭として抽出するものである。As a method for investigating the connection relationship of such blocks, as described in Japanese Patent Publication No. 61-32712,
Blocks that may be connected, block 3 in Figure 2
Or check the last line of 6 and determine whether the end of that line ends with a small amount of black pixels equivalent to a period, and if it does not end with a period, there is always a continuation of the article in the next column. Based on this assumption, in the next row, the line with no spaces at the beginning of the rightmost line is extracted as the beginning of the continuation of the above article.

（発明が解決しようとする課題）しかしながら、ブロックの最後の文字の黒画素数や大き
さだけで判断すると、「、」で終わっているブロックを
文書の最後と判断してしまうことがある。また、これと
は逆に、文書の最後に執筆担当者が括弧（）で囲まれて
記載されている場合があり、このような時には、まだ続
きの文書があると判断してしまう可能性がある。また、
文字文書のつながり関係を規定するものとして罫線が極
めて重要な役割を果たしている新聞では、様々な種類の
飾り罫が使用されており、罫線を正確に抽出するのが難
しい場合が多い。罫線が誤って抽出されると、ブロック
の接続関係を抽出することが極めて困難になる。例えば
、破線等、細がい図形の集まりによって罫線が構成され
ている場合には、１本の罫線を２本の罫線として抽出し
たり、実際の長さよりも短く抽出したり、場合によって
は全く抽出できないことがある。第３図に示すように、
本来罫線として抽出されるべき第３図（ａ）の罫線１０
が分割され、例えば第３図（ｂ）に示すように２本の罫
線１１．１２として抽出されると、文字列ブロック１３
は文字列ブロック１４に接続されるのかまたは文字列ブ
ロック１５に接続されるのかを図形の大きさや位置関係
からだけで判断することができない。(Problem to be Solved by the Invention) However, if a block is judged only by the number of black pixels or the size of the last character of a block, a block ending with "," may be judged to be the last of the document. Conversely, there are cases where the person in charge of writing the document is written in parentheses () at the end of the document, and in such cases, there is a possibility that there is more to the document. be. Also,
In newspapers, where ruled lines play an extremely important role in defining the connections between text documents, various types of decorative ruled lines are used, and it is often difficult to extract the ruled lines accurately. If ruled lines are extracted incorrectly, it becomes extremely difficult to extract the connection relationships between blocks. For example, if a ruled line is made up of a collection of thin shapes, such as a broken line, one ruled line may be extracted as two ruled lines, or it may be extracted shorter than its actual length, or in some cases, it may not be extracted at all. There are things I can't do. As shown in Figure 3,
Ruled line 10 in FIG. 3(a) that should originally be extracted as a ruled line
is divided and extracted as two ruled lines 11 and 12, for example, as shown in FIG. 3(b), the character string block 13
It is not possible to determine whether the character string block 14 or the character string block 15 is connected to the character string block 15 only from the size and positional relationship of the figures.

本発明は、上記に鑑みてなされたもので、その目的とす
るところは、ブロックに含まれる最後の文字の位置、大
きさのみにたよることなく、また罫線等が正確に抽出さ
れていなくても、文書に含まれているブロックのつなが
りを正しく判断し、複数の記事から構成されている文書
を個々の記事に正確に分離し切り分けることができる記
事抽出方式を提供することにある。The present invention has been made in view of the above, and its purpose is to eliminate the need to rely only on the position and size of the last character included in a block, and to eliminate the need to extract ruled lines etc. accurately. Another object of the present invention is to provide an article extraction method that can accurately determine the connections between blocks included in a document and accurately separate and divide a document composed of multiple articles into individual articles.

［発明の構成Ｊ（課題を解決するための手段）上記目的を達成するために、本発明の記事抽出方式は、
複数の記事を含む文書を入力して得られる文書画像デー
タに対し、記事毎に領域を切り分け、各記事を構成して
いる文字列を文字認識により文字コード列に変換する文
書読み取り装置において、一連の処理の制御を行う中央
演算処理装置と、文書をイメージに変換するための画像
入力部と、入力した文書画像等を蓄積する記憶装置と、
一連の処理の結果等を表示したり、コマンド等を入力す
る入出力端末と、入力した文書画像から記事を構成する
単位であるブロックを抽出する文字列ブロック抽出機構
と、該文字列抽出ブロック油田機構によって抽出された
ブロックに含まれる文字列を構成する個々の文字を切り
出す文字切り出し機構と、該文字切り出し機構によって
切り出された個々の文字を認識する文字認識機構と、該
文字認識機構によってコードに変換された文字列の中か
らキーワードを抽出するキーワード抽出機構と、前記文
字列ブロック抽出機構によって抽出された各々のブロッ
クについてキーワード抽出機構によって抽出されたキー
ワードの類似性を比較することにより各ブロックが同一
の記事を構成するものか否かを判定するブロック判定機
構と、該ブロック判定機構から得られた結果をもとに同
一紙面上に配置された複数の記事を記事毎に切り分ける
個別記事抽出機構とを有することを要旨とする。[Configuration of the invention J (Means for solving the problem) In order to achieve the above object, the article extraction method of the present invention is as follows:
A document reading device that divides document image data obtained by inputting a document containing multiple articles into regions for each article, and converts the character strings that make up each article into character code strings through character recognition. a central processing unit that controls processing; an image input unit that converts documents into images; and a storage device that stores input document images and the like;
An input/output terminal that displays the results of a series of processes and inputs commands, etc., a character string block extraction mechanism that extracts blocks that are the units that make up an article from the input document image, and the character string extraction block oil field A character extraction mechanism that extracts individual characters constituting a character string included in a block extracted by the mechanism; a character recognition mechanism that recognizes the individual characters extracted by the character extraction mechanism; A keyword extraction mechanism extracts keywords from the converted character string, and each block is extracted by comparing the similarity of the keywords extracted by the keyword extraction mechanism for each block extracted by the character string block extraction mechanism. A block determination mechanism that determines whether articles constitute the same article, and an individual article extraction mechanism that separates multiple articles arranged on the same paper page into articles based on the results obtained from the block determination mechanism. The gist is to have the following.

（作用）本発明の記事抽出方式では、入力された文書画像から記
事を構成する単位であるブロックを抽出し、この抽出し
た各ブロックの文字列の中からキーワードをそれぞれ抽
出し、この各ブロックからそれぞれ抽出された各キーワ
ードの類似性を比較して各ブロックが同一の記事を構成
するものか否かを判定し、該判定結果に基づいて文書を
各記事毎に切り分けている。(Operation) In the article extraction method of the present invention, blocks, which are units that constitute an article, are extracted from an input document image, keywords are extracted from the character strings of each extracted block, and keywords are extracted from each block. The similarity of each extracted keyword is compared to determine whether each block constitutes the same article, and the document is divided into articles based on the determination result.

（実施例）以下、図面を用いて本発明の詳細な説明する。(Example) Hereinafter, the present invention will be explained in detail using the drawings.

第１図は本発明の一実施例に係わる記事抽出方式の構成
を示すブロック図である。同図に示す記事抽出方式は、
一連の処理の制御等を行う中央演算処理装置（ＣＰＵ）
１７を有し、該中央演算処理装置１７には画像入力部１
−８、記憶装置１９、入出力端末２０、文字列ブロック
抽出機構２１、文字切り出し機構２２、文字認識機構２
３、キーワード抽出機構２４、ブロック判定機構２５及
び記事抽出機構２６が接続されている。FIG. 1 is a block diagram showing the configuration of an article extraction method according to an embodiment of the present invention. The article extraction method shown in the figure is
Central processing unit (CPU) that controls a series of processes, etc.
17, and the central processing unit 17 includes an image input section 1.
-8, storage device 19, input/output terminal 20, character string block extraction mechanism 21, character cutting mechanism 22, character recognition mechanism 2
3. A keyword extraction mechanism 24, a block determination mechanism 25, and an article extraction mechanism 26 are connected.

画像入力部１８はイメージスキャナ等を用いて文書の量
子化、標本化を行い、結果を記憶袋Ｍ１９に蓄積する。The image input unit 18 uses an image scanner or the like to quantize and sample the document, and stores the results in a memory bag M19.

入出力端末２ｏは処理の途中結果の表示、必要なコマン
ドの入力等を行うために使用される。文字列ブロック抽
出機構２１は文書内の文字列の集合である文字列ブロッ
クを抽出する。The input/output terminal 2o is used to display intermediate results of processing, input necessary commands, etc. A character string block extraction mechanism 21 extracts a character string block that is a set of character strings within a document.

この文字列ブロック抽出機構２１は例えば電子通信学会
論文誌Ｊ６９−Ｄ、Ｎｏ、８．ｐｐ、１１８７〜１１９
６（昭６ｌ−０８）に掲載された［周辺分布、線密度、
外接矩形特徴を併用した文書画像の領域分割］に記載さ
れている手法で実現できるものである。なお、前述した
第２図は文書中の見出し文字列１．４および本文のブロ
ック２゜３．５，６．７が抽出された状態を示している
。This character string block extraction mechanism 21 is, for example, Journal of the Institute of Electronics and Communication Engineers J69-D, No. 8. pp, 1187-119
6 (Showa 6l-08) [marginal distribution, linear density,
This can be realized by the method described in [Document Image Region Segmentation Using Circumscribed Rectangle Features]. The above-mentioned FIG. 2 shows a state in which the heading character string 1.4 and the main text blocks 2.degree. 3.5 and 6.7 have been extracted from the document.

これらの手法で完全なブロックの抽出ができない場合に
は、入力画像等をＣＲＴ等の端末装置に表示し、誤抽出
のブロックや未抽出のブロック、または場合によっては
すべてのブロックをマウスのようなデバイスを用いて入
力してもかまわない。If complete blocks cannot be extracted using these methods, display the input image etc. on a terminal device such as a CRT, and use a mouse or similar tool to remove incorrectly extracted blocks, unextracted blocks, or in some cases all blocks. You may input using a device.

また、文字切り出し機構２２は前記文字列ブロック抽出
機構２１で抽出された文書の文字列ブロックを構成する
文字列から個別の文字を切り出すものである。この文字
切り出し機構２２は例えば電子通信学会論文誌Ｊ６７−
Ｄ、Ｎｏ、１０．りｐ、！１９４〜１２０１　（昭５９
−１０）に記載された「非接触文字優先切り出しによる
印刷物からの文字切り出し」によって実現できる。Further, the character cutting mechanism 22 cuts out individual characters from the character strings constituting the character string blocks of the document extracted by the character string block extraction mechanism 21. This character cutting mechanism 22 is, for example,
D.No.10. Rip,! 194-1201 (Sho 59
This can be realized by "character cutting out from printed matter by non-contact character priority cutting" described in 10).

更に、文字認識機構２３は前記文字切り出し機構２２に
よって切り出された個別の文字を認識するものであり、
これは例えば電子通信学会論文誌Ｊ６６−Ｄ、Ｎｏ、１
０．　　ｐｐ、１１８５〜１１９２、（昭５８−１０）
に掲載されている「外郭方向寄与度特徴による手書き漢
字の識別」に記載された方法で実現できる。なお、ここ
で記載された手法が印刷文字にも有効であることは電子
通信学会パターン認識と理解研究会技術方向ＰＲＵ８６
−３３．　　ｐｒ）、３１〜４０に記載されている［書
式未知文書の自動読み取り］で報告されている。Furthermore, the character recognition mechanism 23 recognizes the individual characters cut out by the character cutout mechanism 22,
This is, for example, Journal of the Institute of Electronics and Communication Engineers J66-D, No. 1.
0. pp, 1185-1192, (Sho 58-10)
This can be achieved using the method described in ``Identification of handwritten kanji using contour direction contribution characteristics'' published in . It should be noted that the method described here is also effective for printed characters according to Technical Direction PRU86 of the Institute of Electronics and Communication Engineers Pattern Recognition and Understanding Study Group.
-33. pr), 31-40 [Automatic reading of unknown format documents].

キーワード抽出機構２４は認識された文字コード列の形
態素解析を行い、その結果に基づいてキーワードの抽出
を行う。キーワード抽出機構２４は例えば電子通信学会
論文誌Ｊ６５−Ｄ、Ｎｏ。The keyword extraction mechanism 24 performs morphological analysis of the recognized character code string, and extracts keywords based on the result. The keyword extraction mechanism 24 is, for example, IEICE Journal J65-D, No.

１０、ｐｐ、１１９５〜１２０２．（昭５７−１０）に
記載された［特許請求範囲文の文章解析によるキーワー
ド抽出」またはＮＴＴｉ究実用化報告Ｖｏ１．３６．Ｎ
ｏ、９．Ｉ）Ｉ）、１１５１〜１１５８　（１９８７）
に記載されている「キーワード自動抽出システム（Ｉ　
ＮＤＸＥＲ）Ｊに記載されている方法で実現できる。10, pp. 1195-1202. (1981-10) [Keyword Extraction by Text Analysis of Patent Claims] or NTTi Research and Practical Application Report Vol. 1.36. N
o, 9. I) I), 1151-1158 (1987)
“Keyword automatic extraction system (I)” described in
This can be achieved by the method described in NDXER) J.

ブロック判定機構２５は異なるブロックから抽出された
キーワードを用いて、これらのブロックが同じ記事に属
するか否かを判定する。ブロック判定機構２５の°中で
は、同じキーワードが幾つ含まれているかという基準の
他に、シーソラスにおける上位、下位等の概念などを用
いて含まれているキーワードの距離を求め、それを基準
にしてもよい。前述した第２図は文字列ブロック２．３
に学校、科目といった教育関係のキーワードが多く含ま
れており、文字列ブロック５，６に半導体、ＤＲＡＭ、
メモリといった電子分野のキーワードが含まれている状
態を示している。このような状態において、半導体、計
算機といったキーワードを含む文字列ブロック７はブロ
ック５，６に含マれているキーワードと同じキーワード
、または同分野のキーワードを含んでいるので、ブロッ
ク判定機構２５は文字列ブロック５．６および７は同じ
記事Ａに属すると判断する。記事抽出機構２６は前記ブ
ロック判定機構２５で判定された結果に基づいて個々の
記事の領域を決定する。第２図では、文字列ブロック１
．２．３が一つのまとまった記事、４，５．６．７が更
に別の記事として出力される。個々の記事を分割した結
果は記憶装置１９に格納されてもよいし、または入出力
端末２０を用いて表示してもよい。The block determination mechanism 25 uses keywords extracted from different blocks to determine whether these blocks belong to the same article. In the block judgment mechanism 25, in addition to the criterion of how many of the same keywords are included, the distance between the included keywords is calculated using concepts such as higher rank and lower rank in the thesaurus, and based on that distance. Good too. The above-mentioned figure 2 shows the character string block 2.3.
contains many education-related keywords such as school and subject, and character string blocks 5 and 6 contain words such as semiconductor, DRAM,
This shows that keywords in the electronic field such as memory are included. In this state, since the character string block 7 containing keywords such as semiconductor and computer contains the same keywords as the keywords contained in blocks 5 and 6, or keywords in the same field, the block determination mechanism 25 It is determined that column blocks 5.6 and 7 belong to the same article A. The article extraction mechanism 26 determines the area of each article based on the result determined by the block determination mechanism 25. In Figure 2, string block 1
．． 2.3 is output as a single article, and 4, 5, 6, and 7 are output as separate articles. The results of dividing each article may be stored in the storage device 19, or may be displayed using the input/output terminal 20.

【発明の効果］以上説明したように、本発明によれば、複数の記事から
構成される文書の中に含まれるキーワードを文字列のブ
ロック毎に求め、これらのキーワードの類似性に基づい
てブロックがどのキーワードに属するかを判定している
ので、記事の最後の句点を抽出しなくても、また罫線等
が正確に抽出されていなくても、これらの複数の記事を
個々の記事に正確に切り分けることができる。[Effects of the Invention] As explained above, according to the present invention, keywords included in a document consisting of a plurality of articles are obtained for each block of character strings, and blocks are created based on the similarity of these keywords. Since it determines which keyword the article belongs to, it is possible to accurately convert these multiple articles into individual articles, even if the last period of the article is not extracted or the borders etc. are not extracted accurately. It can be separated.

[Brief explanation of drawings]

第１図は本発明の一実施例に係わる記事抽出方式の構成
を示すブロック図、第２図は２つの記事から構成される
文書の列を示す図、第３図は罫線によって文字列のブロ
ックが分けられている列を示す図である。１７・・・中央演算処理装置、２１・・・文字列ブロック抽出機構、２２・・・文字切り出し機構、２４・・・キーワード抽出機構、２５・・・ブロック判定機構、２６・・・記事抽出機構。FIG. 1 is a block diagram showing the configuration of an article extraction method according to an embodiment of the present invention, FIG. 2 is a diagram showing a sequence of documents consisting of two articles, and FIG. 3 is a block diagram of character strings formed by ruled lines. FIG. 17... Central processing unit, 21... Character string block extraction mechanism, 22... Character extraction mechanism, 24... Keyword extraction mechanism, 25... Block determination mechanism, 26... Article extraction mechanism .

Claims

[Claims]

A document reading device that divides document image data obtained by inputting a document containing multiple articles into regions for each article, and converts the character strings that make up each article into character code strings through character recognition. a central processing unit that controls processing; an image input unit that converts documents into images; and a storage device that stores input document images and the like;
An input/output terminal that displays the results of a series of processes and inputs commands, etc., a character string block extraction mechanism that extracts blocks that are the units that make up an article from the input document image, and the character string extraction block extraction A character extraction mechanism that extracts individual characters constituting a character string included in a block extracted by the mechanism; a character recognition mechanism that recognizes the individual characters extracted by the character extraction mechanism; A keyword extraction mechanism extracts keywords from the converted character string, and each block is extracted by comparing the similarity of the keywords extracted by the keyword extraction mechanism for each block extracted by the character string block extraction mechanism. A block determination mechanism that determines whether articles constitute the same article, and an individual article extraction mechanism that separates multiple articles arranged on the same paper page into articles based on the results obtained from the block determination mechanism. An article extraction method characterized by having the following.