JPH0922442A

JPH0922442A - Electronic management system for image document data

Info

Publication number: JPH0922442A
Application number: JP7191209A
Authority: JP
Inventors: Toshito Hara; 敏人原; Sadanari Iwata; 完成岩田
Original assignee: Advantest Corp
Current assignee: Advantest Corp
Priority date: 1995-07-04
Filing date: 1995-07-04
Publication date: 1997-01-21

Abstract

PROBLEM TO BE SOLVED: To improve the utility value of an electronic data base by separating and extracting the character and segment information out of the dot information of an image data form and converting these extracted information into the codes and managing these codes by a text file of high utility value. SOLUTION: This system consists of an electronic file medium 30, a paper medium 20, a scanner 8, a segment/character separation part 13, an editing device 2, a recording medium 9 and a retrieval device 11. The part 13 receives the image data from the medium 30 or the scanner 8, normalizes with thinning of lines the image data into an (N-dot×M-dot) size, allocates the normalized image data to 1000 types of vertical, horizontal, upper right oblique and upper left oblique line elements, divides these allocated image data into (A×B) areas to compare them with a character symbol dictionary J5 which is previously produced and stored, and then separates in sequence the divided areas into the segments and the character symbol/photo images by extraction of features in order to discriminate the directional line element feature value by the multi- dimensional vector out of the image data.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、図形情報や写真／画
像情報や文字情報等のイメージデータ形式の情報の中か
ら文字コードを識別抽出して情報検索や管理を容易にす
るイメージ・ドキュメント資料の電子管理システムに関
する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an image / document material for facilitating information retrieval and management by identifying and extracting character codes from image data format information such as graphic information, photograph / image information and character information. Regarding electronic management system.

【０００２】[0002]

【従来の技術】ドキュメント資料としては、文字情
報、図面、写真画像、音声データがある。従来に
おいては、これらの中で図形情報や写真／画像情報や文
字情報等が混在したイメージデータ形式の情報の中から
文字コードを識別抽出する装置は見当たらない。ここで
イメージデータ形式の情報とは、例えばイメージ・スキ
ャナ（image scanner）により読み込んだ点画のドット
画像情報である。2. Description of the Related Art Document materials include character information, drawings, photographic images, and audio data. Conventionally, there is no device for identifying and extracting a character code from information in an image data format in which graphic information, photograph / image information, character information and the like are mixed among them. Here, the information of the image data format is, for example, dot image information of a dot image read by an image scanner.

【０００３】ドキュメント資料としては、紙媒体のもの
と、電子ファイル媒体とがある。紙媒体をドキュメント
資料として取り込む場合は、一般にイメージスキャナで
読み取って電子ファイル化する。また、データ形式とし
て、文字コード形式、イメージデータ形式がある。文字
コード形式のデータとは例えばＪＩＳ漢字コードで表現
されたテキスト形式のデータ群であり、直接任意の文字
コードで検索可能なものであり、イメージデータ形式の
データとは、図形や画像や写真等のドット情報からなる
データ群であって、文字コードでは検索不可能なもので
ある。Document materials include paper media and electronic file media. When a paper medium is taken in as document data, it is generally read by an image scanner and converted into an electronic file. The data format includes a character code format and an image data format. The character code format data is, for example, a text format data group expressed in JIS kanji code, which can be directly searched by an arbitrary character code, and the image data format data is a figure, an image, a photograph, or the like. It is a data group consisting of dot information of, and cannot be searched by the character code.

【０００４】図４に従来のドキュメント資料の電子管理
システム構成を示す。システム構成は、電子ファイル媒
体３０と、紙媒体２０と、スキャナ８と、編集機２と、
記録媒体９と、検索機１１とで成る。FIG. 4 shows the configuration of a conventional electronic management system for document materials. The system configuration includes an electronic file medium 30, a paper medium 20, a scanner 8, an editing machine 2,
It comprises a recording medium 9 and a searcher 11.

【０００５】電子ファイル媒体３０は、フロッピーディ
スクやＭＴ等の記録媒体や、通信回線からの通信データ
等であり、編集機２に直接入力可能な形態の媒体であ
る。また、この媒体は、文字コード形式、あるいはイメ
ージデータ形式の両データ形式で格納されている。The electronic file medium 30 is a recording medium such as a floppy disk or MT, communication data from a communication line, etc., and is a medium which can be directly input to the editing machine 2. Further, this medium is stored in both a character code format and an image data format.

【０００６】紙媒体２０は、編集機２に直接入力できな
い形態の媒体であって、紙の図面や印刷された書籍や写
真や画像等の媒体である。スキャナ８は、前記紙媒体２
０を編集機２で入力可能なイメージデータ形式のドット
情報に変換して供給するものである。The paper medium 20 is a medium that cannot be directly input to the editing machine 2, and is a medium such as a paper drawing, a printed book, a photograph, or an image. The scanner 8 uses the paper medium 2
0 is converted into dot information in the image data format that can be input by the editing machine 2 and supplied.

【０００７】編集機２は、電子ファイル媒体３０あるい
はスキャナ８からの入力データを受けて、単位データ毎
に記録媒体９に保存し、このデータ形式情報とこれに対
応したインデックスを付加して、索引・検索の為の見出
しとして対応付けて保存しておく。これによって、あら
ゆる情報の電子データベース化が実現される。The editing machine 2 receives the input data from the electronic file medium 30 or the scanner 8 and stores it in the recording medium 9 for each unit data, adds this data format information and the index corresponding thereto, and creates an index.・ Save as a heading for searching. By this, an electronic database of all information is realized.

【０００８】検索機１１は、外部からの検索条件の入力
パラメータ１２を受けて、記録媒体９の中で検索条件の
成立したデータ群をそのデータ群に対応したデータ表示
形式で表示出力する。これにより、随時必要とするデー
タの検索／閲覧を可能にしている。The searcher 11 receives the input parameter 12 of the search condition from the outside, and displays and outputs the data group satisfying the search condition in the recording medium 9 in the data display format corresponding to the data group. As a result, it is possible to search / view data that is needed at any time.

【０００９】[0009]

【発明が解決しようとする課題】上記説明において、ス
キャナ８等で取り込んだイメージデータ形式の情報の中
には文字情報がある。このドット表現の文字情報を文字
コード化変換出来れば一層有効な電子データベースとし
て利用価値が格段に向上する。従来、本や新聞の紙面等
から定形の文章を文字コードに変換する読み取り装置は
存在している。In the above description, character information is included in the image data format information captured by the scanner 8 or the like. If the character information of this dot expression can be converted into a character code, the utility value will be significantly improved as a more effective electronic database. 2. Description of the Related Art Conventionally, there is a reading device that converts a fixed-size sentence into a character code from the surface of a book or newspaper.

【００１０】しかし、イメージデータ形式の情報の中に
は、例えば回路図面のように、回路シンボルとシンボル
文字情報や接続配線線分情報が混在して存在するものが
あり、これらのドット情報の中から目的とする文字情報
を分離抽出するのは困難であり、実現されていないのが
現状である。この為、文字コードのデータベースファイ
ルにして、任意に検索やコピーや加工等の利用価値の高
いデータベースに変換して保管出来なかった。However, some information in the image data format, such as a circuit drawing, has a mixture of circuit symbols, symbol character information, and connecting wiring line segment information, and among these dot information. It is difficult to separate and extract the target character information from, and it is not realized yet. For this reason, it was not possible to make a character code database file and arbitrarily convert it into a highly useful database for searching, copying, processing, etc. for storage.

【００１１】そこで、本発明が解決しようとする課題
は、イメージデータ形式のドット情報の中から、文字情
報や線分情報を分離抽出してコード変換し、利用価値の
高いテキストファイルで管理可能にすることを目的とす
る。Therefore, the problem to be solved by the present invention is to separate and extract character information and line segment information from the dot information in the image data format, convert the code, and manage the text file with high utility value. The purpose is to do.

【００１２】[0012]

【課題を解決する為の手段】上記課題を解決するため
に、本発明の構成では、イメージデータ１００の中で、
方向線素特徴量を多次元のベクトルにより識別する為
に、Ｎドット×Ｍドットの大きさに正規化し、線細化し
て、縦、横、右上斜め、左上斜めの４種類の線素に割り
当てた後、これを更に、Ａ×Ｂの領域に分割し、予め作
成保存した文字記号辞書Ｊ５と比較対比して、線分と文
字記号・写真画像とに特徴抽出により順次分離する線分
文字分離認識部１３を設ける構成手段にする。これによ
り、電子ドキュメント資料である図形や画像等のドット
情報からなるイメージデータ形式のデータを受けて、こ
のイメージデータ１００から文字情報、線分情報、画像
情報に分離抽出してコード化した利用価値の高いコード
データ形式に変換したドキュメント資料電子管理装置を
実現する。In order to solve the above-mentioned problems, in the structure of the present invention, in the image data 100,
In order to identify the direction line element feature amount by a multidimensional vector, it is normalized to a size of N dots × M dots, thinned, and assigned to four types of line elements: vertical, horizontal, diagonally upper right, diagonally upper left. After that, this is further divided into an A × B area, and compared with a character / symbol dictionary J5 created and stored in advance, and a line segment and a character / symbol / photo image are sequentially separated by feature extraction. The recognition means 13 is provided as a constituent means. As a result, the utility value is obtained by receiving data in the image data format including dot information such as figures and images, which are electronic document materials, and separating and extracting from the image data 100 into character information, line segment information, and image information. To realize an electronic management device for document materials converted to a high code data format.

【００１３】上記構成に加えて、線分文字分離認識部１
３による分離した文字コードと線分データと写真画像デ
ータを受けて、元のイメージデータ１００に対応する図
に復元する関連付け情報を付加して記録媒体に保存する
イメージ・ドキュメント資料の電子管理装置を実現す
る。これにより、文字コードによる任意検索や文字列や
文章のコピーや加工等のデータベースファイルとして一
層利用価値の高い電子管理装置となる。In addition to the above configuration, the line segment character separation recognition unit 1
An electronic management device for image / document data, which receives the character code, the line segment data, and the photographic image data separated by 3 and adds the association information to be restored to the diagram corresponding to the original image data 100 and saves it in the recording medium. To be realized. As a result, the electronic management device has a higher utility value as a database file for arbitrary searches by character codes, copying and processing of character strings and sentences, and the like.

【００１４】[0014]

【実施例】図１に本発明のドキュメント資料の電子管理
システム構成を示す。分離抽出の実現概要を説明する。
線分・文字記号・写真画像との分離方法は、基本的には
それぞれの特徴抽出による。即ち、スキャナ８で読み込
まれた情報を方向線素特徴量を多次元のベクトルにより
識別する。先ず、１つの情報をＮドット×Ｍドットの大
きさに正規化し、線細化する。線細化した後、縦、横、
右上斜め、左上斜めの４種類の線素に割り当てる。これ
を更に、Ａ×Ｂの領域に分割し、予め作成保存した認識
対象の線分・文字記号・写真画像の代表ベクトル例えば
文字記号辞書Ｊ５と比較対比して、線分と文字記号・写
真画像の、更に、文字記号と写真画像とに順次分離す
る。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 shows the configuration of an electronic management system for document material of the present invention. The implementation outline of separation and extraction will be described.
The method of separating line segments, character symbols, and photographic images is basically based on feature extraction. That is, the direction line element feature amount of the information read by the scanner 8 is identified by a multidimensional vector. First, one piece of information is normalized to a size of N dots × M dots and thinned. After thinning the line, vertical, horizontal,
It is assigned to four types of line elements, diagonally upper right and diagonally upper left. This is further divided into an A × B area, and compared with a representative vector of the recognition target line segment / character symbol / photo image, which is created and saved in advance, for example, the character / symbol dictionary J5 to compare the line segment with the character symbol / photo image. Further, the character symbol and the photographic image are sequentially separated.

【００１５】システム構成は、電子ファイル媒体３０
と、紙媒体２０と、スキャナ８と、線分文字分離認識部
１３と、編集機２と、記録媒体９と、検索機１１とで成
る。本発明は、線分文字分離認識部１３を追加した構成
で成る。The system configuration is an electronic file medium 30.
A paper medium 20, a scanner 8, a line segment character separation recognition unit 13, an editing machine 2, a recording medium 9, and a search machine 11. The present invention has a configuration in which a line segment character separation recognition unit 13 is added.

【００１６】線分文字分離認識部１３は、電子ファイル
媒体３０あるいはスキャナ８からのイメージデータ１０
０を受けて、このイメージデータの中から文字記号と図
面と写真画像に分離して文字記号コードを抽出するもの
である。この線分文字分離認識部１３の機能構造は、図
２に示すように、第１線分分離部Ａ１と、第２線分分離
部Ａ２と、第３線分分離部Ａ３と、第１文字記号分離部
Ｂ１と、第２文字記号分離部ＡＢ１と、写真画像分離部
Ｃ１と、第４文字記号分離部Ｃ２と、第５文字記号分離
部Ｃ３と、写真画像分離部Ｄ１と、文字認識比較部Ｃ５
と、文字記号辞書Ｊ５と、文字格納部Ｃ６と、判別不能
出力部Ｃ７と、手動識別部Ｃ８とで成る。The line segment character separation recognizing unit 13 operates the image data 10 from the electronic file medium 30 or the scanner 8.
Upon receiving 0, the character / symbol code is extracted from this image data by separating it into a character / symbol, a drawing and a photographic image. As shown in FIG. 2, the functional structure of the line segment character separation recognition unit 13 includes a first line segment separation unit A1, a second line segment separation unit A2, a third line segment separation unit A3, and a first character segment. Character recognition comparison between the symbol separation unit B1, the second character symbol separation unit AB1, the photographic image separation unit C1, the fourth character symbol separation unit C2, the fifth character symbol separation unit C3, the photographic image separation unit D1 Part C5
And a character / symbol dictionary J5, a character storage unit C6, an unidentifiable output unit C7, and a manual identification unit C8.

【００１７】第１線分分離部Ａ１は、イメージデータ１
００を受けて、線分と文字記号／写真画像領域データに
分離出力する。第２線分分離部Ａ２は、第１線分分離部
Ａ１からの線分領域データを受けて、線分と文字記号／
写真画像に分離出力する。第１文字記号分離部Ｂ１は、
第１線分分離部Ａ１からの文字記号／写真画像を受け
て、文字記号と写真画像に分離出力する。The first line segment separating unit A1 is used for the image data 1
00, the line segment and the character / symbol / photograph image area data are separately output. The second line segment separating unit A2 receives the line segment area data from the first line segment separating unit A1, and receives the line segment and the character symbol /
Separate output to photographic image. The first character symbol separator B1 is
The character / symbol / photo image from the first line segment separating unit A1 is received, and the character / symbol and the photo image are separated and output.

【００１８】写真画像分離部Ｃ１は、第１文字記号分離
部Ｂ１からの写真画像データを受けて、更に文字記号と
写真画像に分離出力する。第５文字記号分離部Ｃ３は、
写真画像分離部Ｃ１からの写真画像を受けて、この中に
残っている文字記号を抽出し、写真画像とを分離出力す
る。写真画像分離部Ｄ１は、第５文字記号分離部Ｃ３が
抽出した写真画像データを出力する。The photographic image separating unit C1 receives the photographic image data from the first character / symbol separating unit B1, and further separates and outputs the character / symbol and the photographic image. The fifth character symbol separator C3
Upon receiving the photographic image from the photographic image separating section C1, the character symbols remaining therein are extracted, and the photographic image and the photographic image are separated and output. The photographic image separating section D1 outputs the photographic image data extracted by the fifth character / symbol separating section C3.

【００１９】第２文字記号分離部ＡＢ１は、第２線分分
離部Ａ２と第１文字記号分離部Ｂ１からの抽出文字デー
タを受けて、両者を合成した後、文字記号格納部Ｃ４に
格納する。また、第４文字記号分離部Ｃ２と第５文字記
号分離部Ｃ３からの抽出文字データも、文字記号格納部
Ｃ４に一旦格納する。The second character / symbol separating section AB1 receives the extracted character data from the second line segment separating section A2 and the first character / symbol separating section B1, synthesizes the two and stores them in the character / symbol storing section C4. . Further, the extracted character data from the fourth character / symbol separation unit C2 and the fifth character / symbol separation unit C3 is also temporarily stored in the character / symbol storage unit C4.

【００２０】文字認識比較部Ｃ５では、文字記号格納部
Ｃ４からの文字イメージデータを受けて、文字記号辞書
Ｊ５データ群である字体別や各サイズ別の比較用辞書と
順次比較して最も一致率の高い文字コードをその文字イ
メージデータの文字コードとして判別し、これを文字格
納部Ｃ６を経由して編集機２に供給する。また、一致率
が所望のレベルより低い場合には、判別不能な文字とし
て手動識別部Ｃ８側に出力する。手動識別部Ｃ８は、判
定不能な曖昧な字形を画面に表示して、人が識別し、識
別された文字コードを文字格納部Ｃ６を経由して編集機
２に供給する。The character recognition / comparison unit C5 receives the character image data from the character / symbol storage unit C4 and sequentially compares it with the character dictionary / character size dictionary comparison dictionary, which is a character / symbol dictionary J5 data group. Of the character image data is discriminated as the character code of the character image data, and the character code is supplied to the editing machine 2 via the character storage section C6. If the matching rate is lower than the desired level, it is output to the manual identification section C8 as an unidentifiable character. The manual identification unit C8 displays an ambiguous character shape that cannot be determined on the screen, identifies a person, and supplies the identified character code to the editing machine 2 via the character storage unit C6.

【００２１】上記手段によって得られた結果例を図３に
示す。図３（ａ）は、原図であるイメージデータ１００
の回路図である。図３（ｂ）は線分のみを分離した結果
である。図３（ｃ）には、線分と分離して抽出した文字
コードを再度合体され印刷した結果である。An example of the results obtained by the above means is shown in FIG. FIG. 3A shows the original image data 100.
FIG. FIG. 3B shows the result of separating only the line segments. In FIG. 3C, the character codes extracted separately from the line segments are merged again and printed.

【００２２】編集機２は、前記で分離抽出した文字コー
ドと、線分（図形データ）と写真画像を受けて、イメー
ジ図として復元可能なように、各々を関連付けさせた形
式で保存し、かつこれに保存用のインデックスを付け
て、索引・検索の為の見出しと対応付けて記録媒体９に
保存しておく。これによって、抽出した文字コードから
検索時のキーワードとして利用可能になり、より利用価
値の高い情報のデータベース化が実現できる。特に書籍
や図面等の文字コードを多く含んだイメージデータの場
合に極めて有効に検索等の利用ができる。The editing machine 2 receives the character code, the line segment (graphic data) and the photographic image which have been separated and extracted as described above, and stores them in an associated form so that they can be restored as an image diagram, and An index for storage is attached to and stored in the recording medium 9 in association with a heading for index / search. As a result, the extracted character code can be used as a keyword at the time of search, and a database of more valuable information can be realized. In particular, in the case of image data including many character codes such as books and drawings, it is possible to use search and the like very effectively.

【００２３】また、記録媒体９に、上記で抽出して得ら
れたキーワードを検索用のインデックスとして登録して
も良いし、後で随時追加変更しても良くデータベースの
参照検索を容易にしておく。また、分離した文字コード
や図形データや写真画像等を随時編集／修正／メモ記入
して利用価値の高いデータベースにしても良い。Further, the keywords obtained by the above extraction may be registered in the recording medium 9 as an index for retrieval, or may be added or changed at any time later to facilitate the reference retrieval of the database. . Further, the separated character code, graphic data, photographic image, etc. may be edited / corrected / written in as needed to make a highly useful database.

【００２４】[0024]

【発明の効果】本発明は、以上説明したように構成され
ているので、下記に記載されるような効果を奏する。線
分文字分離認識部１３は、イメージデータ形式の電子フ
ァイルや、紙媒体からスキャナ８で読み取ったイメージ
データを受けて、このイメージデータの中から文字記号
と図面と写真画像に分離して文字記号コードを抽出する
ことで、編集機２で文字コードで検索時のキーワードと
して利用可能になり、より利用価値の高い情報のデータ
ベース化が実現できる効果が得られる。特に回路図や建
築図面等の文字コードを多く含んだイメージデータの分
離抽出に有効である。Since the present invention is configured as described above, it has the following effects. The line segment character separation recognition unit 13 receives an electronic file in an image data format or image data read by a scanner 8 from a paper medium, and separates a character symbol, a drawing, and a photographic image from the image data to obtain a character symbol. By extracting the code, it becomes possible to use the character code in the editing machine 2 as a keyword at the time of searching, and an effect that a database of information of higher utility value can be realized can be obtained. Especially, it is effective for separating and extracting image data including many character codes such as circuit diagrams and architectural drawings.

【００２５】記録媒体９に保存するインデックスで検索
用のキーワードを、後で随時追加変更可能にしてデータ
ベースの参照検索を容易にしたりまた、分離した文字コ
ードや図形データや写真画像等を随時編集／修正／メモ
記入して利用価値の高いデータベースにすることも可能
である。The index stored in the recording medium 9 can be used to add and change keywords for search later to facilitate reference search of the database, and separate character codes, graphic data, photographic images, etc. can be edited / edited at any time. It is also possible to make corrections / notes and make a highly useful database.

[Brief description of drawings]

【図１】本発明の、ドキュメント資料の電子管理システ
ム構成図である。FIG. 1 is a block diagram of an electronic management system for document material according to the present invention.

【図２】本発明の、線分文字分離認識部１３の機能構造
図である。FIG. 2 is a functional structure diagram of a line segment character separation recognition unit 13 of the present invention.

【図３】本発明の、（ａ）は原図であるイメージデータ
１００である。（ｂ）はイメージデータ１００から線分
のみを分離した図である。（ｃ）は分離した線分と抽出
した文字コードを再度合成した図である。FIG. 3A of the present invention is image data 100 which is an original drawing. (B) is a diagram in which only line segments are separated from the image data 100. (C) is a diagram in which the separated line segment and the extracted character code are combined again.

【図４】従来の、ドキュメント資料の電子管理システム
構成図である。FIG. 4 is a block diagram of a conventional electronic management system for document materials.

[Explanation of symbols]

Ａ１第１線分分離部Ｂ１第１文字記号分離部ＡＢ１第２文字記号分離部Ｃ１、Ｄ１写真画像分離部２編集機Ａ２第２線分分離部Ｃ２第４文字記号分離部Ａ３第３線分分離部Ｃ３第５文字記号分離部Ｃ４文字記号格納部Ｊ５文字記号辞書Ｃ５文字認識比較部Ｃ６文字格納部Ｃ７判別不能出力部８スキャナＣ８手動識別部９記録媒体１１検索機１３線分文字分離認識部２０紙媒体３０電子ファイル媒体１００イメージデータ A1 First line segment separation unit B1 First character symbol separation unit AB1 Second character symbol separation unit C1, D1 Photo image separation unit 2 Editing machine A2 Second line segment separation unit C2 Fourth character symbol separation unit A3 Third line segment Separation unit C3 Fifth character symbol separation unit C4 Character symbol storage unit J5 Character symbol dictionary C5 Character recognition comparison unit C6 Character storage unit C7 Unidentifiable output unit 8 Scanner C8 Manual identification unit 9 Recording medium 11 Search machine 13 Line segment character separation recognition Part 20 Paper medium 30 Electronic file medium 100 Image data

Claims

[Claims]

1. An image data format data consisting of dot information of a drawing which is an electronic document material is received,
In the electronic document management device for document data that separates and extracts from this image data (100) into character information, line segment information, and image information, in the image data (100), the direction line element feature amount is a multidimensional vector. In order to discriminate by using, the size is normalized to N dots × M dots, the line is thinned, and assigned to four types of line elements of vertical, horizontal, diagonally upper right, and diagonally upper left. Line segment and character symbol / photograph image are sequentially separated by feature extraction by comparing with a character / symbol dictionary J5 created and stored in advance.
An electronic management system for image and document materials characterized by having 3) and having the above.

2. The line segment character separation recognition unit (13) receives the separated character code, line segment data, and photographic image data, and adds association information to be restored to the drawing corresponding to the original image data (100). An electronic management system for image / document data according to claim 1, which is stored in a recording medium.