JP2007011683A

JP2007011683A - Document management support device

Info

Publication number: JP2007011683A
Application number: JP2005191607A
Authority: JP
Inventors: Kei Tanaka; 圭田中; Shoichi Tateno; 昌一舘野; Toshiya Koyama; 俊哉小山; Teruka Saito; 照花斎藤; Masayoshi Sakakibara; 正義榊原; Kotaro Nakamura; 浩太郎中村
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2005-06-30
Filing date: 2005-06-30
Publication date: 2007-01-18

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document management support device supporting management of a document to which handwritten annotation is added. <P>SOLUTION: In receipt of image data representing a document in which a print character and a handwritten character are mixed, a control part 21 stores the received image data in a storage part 25. Document image data are generated from the image data. The document image data are divided into image data for a print character area, in which a print character is written, and image data for a handwritten character area, in which a handwritten character is written. In each area, character recognition processing is carried out for generating text data, and an identifier, which shows a storage place of the generated text data and the image data including the text data, and an identifier, which shows whether the area is the print character area or the handwritten character area, are stored in an index table. When a keyword is inputted in an operation part 24, the index table is searched for displaying matching image data in a display part 23. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、手書きによるアノテーションが付加された文書の管理を支援する技術に関する。 The present invention relates to a technique for supporting management of a document to which handwritten annotation is added.

電子化した文書について行う全文検索は、予め作成されたインデックステーブルを元に行う。このとき電子化する文書には、手書きで記載されたアノテーションが付加されていることがある。このアノテーションは文章の重要点を示していたり、注意を喚起する内容であったり、重要な事項であることが多い。 The full text search performed on the digitized document is performed based on an index table created in advance. At this time, an annotation written by hand may be added to the document to be digitized. This annotation often indicates an important point of a sentence, a content that calls attention, or an important matter.

特許文献１には、蛍光ペンによって重要部分がマークされた紙文書を蓄積する際、該マーク箇所に文字認識処理を施し、インデックスに登録する技術が開示されている。
特開平５−２３３７０５号公報 Japanese Patent Application Laid-Open No. 2004-133260 discloses a technique for performing character recognition processing on a marked portion and registering it in an index when a paper document in which an important part is marked with a highlighter pen is accumulated.
JP-A-5-233705

しかしながら、特許文献１に開示された技術においては、インデックスとして登録されるのは蛍光ペンでマークされた箇所の文字のみであるため、蛍光ペンでマークされていない箇所の文字についてはインデックスが作成されない。そのため、所望の文字を検索の対象とするには、いちいち蛍光ペンでマークしなければならない。一方、よく行われるアノテーション付加方法は、手書きによって文書の余白に文字を書き込む方法である。しかし、特許文献１に記載の発明においては、アノテーションを付加したとしても、その上にさらにマークをしなければならず面倒であった。 However, in the technique disclosed in Patent Document 1, since only the characters at the location marked with the highlighter are registered as the index, no index is created for the characters at the location not marked with the highlighter. . Therefore, in order to search for a desired character, it is necessary to mark it with a highlighter. On the other hand, a commonly used annotation adding method is a method of writing characters in the margin of a document by handwriting. However, in the invention described in Patent Document 1, even if an annotation is added, it is troublesome because a mark must be additionally provided thereon.

本発明はこのような事情に鑑みてなされたものであり、その目的は、紙文書に含まれる活字文字とアノテーションとして付加された手書文字とを認識し、それぞれの文字についてのインデックスを作成し、作成したインデックスを元に検索を行うことが出来る文書管理支援装置を提供することにある。 The present invention has been made in view of such circumstances, and its purpose is to recognize printed characters included in paper documents and handwritten characters added as annotations and create an index for each character. Another object of the present invention is to provide a document management support apparatus that can perform a search based on a created index.

上記課題を解決するために、本発明は、文書を走査し、前記文書の内容を表す文書画像データを取得する文書画像データ生成手段と、前記文書画像データ生成手段が生成した前記文書画像データを記憶する記憶手段と、前記文書画像データ生成手段が生成した前記文書画像データから、１又は複数の活字文字が記されている活字領域の画像データと、１又は複数の手書文字が記されている手書領域の画像データとを切り出す領域分離手段と、前記活字領域の画像データと、前記手書領域の画像データのそれぞれに文字認識処理を施し、認識文字列を出力する文字認識処理手段と、前記記憶手段が記憶した各画像データの格納場所を示す格納場所識別子と、前記文字認識処理手段が出力した認識文字列と、前記認識文字列が手書領域の画像データを表す文字列であるか活字領域の画像データを表す文字列であるかを示す文字識別子とをそれぞれ関連づけて記憶するインデックス情報記憶手段とを具備する文書管理支援装置を提供する。 In order to solve the above problems, the present invention scans a document and obtains document image data representing the contents of the document, and the document image data generated by the document image data generation unit. From the storage means for storing and the document image data generated by the document image data generation means, the image data of the type region where one or more type characters are written, and one or more handwritten characters are written. Region separating means for cutting out image data of a handwritten area, character recognition processing means for performing character recognition processing on each of the image data of the printed area and the image data of the handwritten area, and outputting a recognized character string; A storage location identifier indicating a storage location of each image data stored in the storage unit, a recognized character string output from the character recognition processing unit, and an image data in the handwriting area. Providing document management support apparatus comprising an index information storage means for respectively associating and storing a character identifier indicating whether a character string representing the image data if it were a string print region representing the data.

この文書管理支援装置によれば、文書画像データから活字文字が記載されている活字領域と手書文字が記載されている手書領域とを分離し、それぞれの文字領域について文字認識を行いインデックスを作成するので、紙文書の余白等にメモとして手書きで記載されたアノテーションもインデックス作成対象とすることができる。従って、活字文字及び手書文字についてのインデックステーブルを作成することができる。また、インデックステーブルには、該インデックステーブルに格納された活字文字及び手書文字に関連づけてそれぞれの文字が含まれる文書画像データが格納されているアドレスも格納されるので、インデックステーブルを参照することにより、所定の文字の所在を特定することができる。 According to this document management support apparatus, a print area in which printed characters are written and a handwritten area in which handwritten characters are written are separated from document image data, character recognition is performed for each character area, and an index is set. Since it is created, an annotation written by hand as a memo in the margin of a paper document can also be set as an index creation target. Therefore, an index table for type characters and handwritten characters can be created. The index table also stores addresses at which document image data containing each character is stored in association with printed characters and handwritten characters stored in the index table. Refer to the index table. Thus, the location of the predetermined character can be specified.

本発明の好ましい態様において、前記領域分離手段は、前記文書画像データから、注目領域を特定する手段と、前記注目領域内において隣り合う２つの文字の間の距離を当該２つの文字列の組毎に求め、求めた各距離のばらつきの程度が所定の範囲を超えないときは前記注目領域が活字領域であると判断する一方で、当該ばらつきの程度が所定の範囲を超えるときは前記注目領域が手書き領域であることを判断する手段とを含むようにしてもよい。 In a preferred aspect of the present invention, the region separating unit determines, from the document image data, a unit for specifying a region of interest and a distance between two characters adjacent to each other in the region of interest for each set of the two character strings. When the degree of variation of each distance obtained does not exceed a predetermined range, it is determined that the region of interest is a type region, while when the amount of variation exceeds a predetermined range, the region of interest is And a means for determining that it is a handwritten area.

一般に、活字文字の隣り合う２つの文字の間隔は一定であるが、手書文字の隣り合う２つの文字の間隔は一定ではない。従って、領域分離手段によって抽出した領域内の文字列の文字の間隔が、所定の値を元に一定であるか否かを判定することにより、該領域が活字及び手書のいずれの文字列で表された画像データであるかを判断することができる。 In general, the interval between two adjacent characters in a printed character is constant, but the interval between two adjacent characters in a handwritten character is not constant. Therefore, by determining whether or not the character spacing of the character string in the region extracted by the region separation means is constant based on a predetermined value, the region can be any character string of type and handwriting. Whether the image data is represented can be determined.

また、検索キーとして入力された文字列と、前記インデックス情報記憶手段に記憶されている認識文字列とを照合することにより、前記インデックス情報記憶手段を検索する検索手段と、前記検索手段が検索した結果を表示する表示手段とを具備するようにしてもよい。これによれば、検索キーとなる文字列を元にインデックステーブルを検索し、検索結果を表示することができる。 Further, a search means for searching the index information storage means by comparing a character string input as a search key with a recognized character string stored in the index information storage means, and the search means You may make it comprise the display means which displays a result. According to this, it is possible to search the index table based on the character string serving as the search key and display the search result.

さらに、前記検索キーとなる文字列を前記手書領域及び前記活字領域のいずれの領域から検索するかを指定する指定手段を具備し、前記検索手段は、前記指定手段によって指定された領域について、前記検索を行うようにしてもよい。 Furthermore, it comprises designation means for designating from which area of the handwriting area and the type area the character string that serves as the search key, and the search means for the area designated by the designation means, The search may be performed.

以下、本発明の実施形態について図面を用いて詳細に説明する。
図１は、本発明に係る文書管理支援装置の１実施形態である文書管理支援システム１の構成例を示すブロック図である。図１の画像読取装置１０は、例えばＡＤＦ（Auto Document Feeder）などの自動給紙機構を備えたスキャナ装置であり、ＡＤＦにセットされた紙文書を１ページずつ光学的に読み取り、読み取った画像に対応する画像データを通信線１２を介して文書管理支援装置２０に転送するものである。通信線１２はＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）、インターネットなどを用いることができるが、本実施形態においては、通信線１２としてＬＡＮを用いている。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a block diagram showing a configuration example of a document management support system 1 which is an embodiment of a document management support apparatus according to the present invention. An image reading apparatus 10 in FIG. 1 is a scanner apparatus having an automatic paper feed mechanism such as an ADF (Auto Document Feeder), for example, and optically reads a paper document set in the ADF page by page, and converts the read image into an image. Corresponding image data is transferred to the document management support apparatus 20 via the communication line 12. As the communication line 12, a LAN (Local Area Network), a WAN (Wide Area Network), the Internet, or the like can be used. In this embodiment, a LAN is used as the communication line 12.

図２は、文書管理支援装置２０のハードウェア構成を示すブロック図である。
制御部２１は、例えばＣＰＵ（Central Processing Unit）であり、各種ソフトウェアを実行することによって、文書管理支援装置２０の各部を制御する。通信インタフェス（以下「ＩＦ」と称す）部２２は通信線１２を介して画像読取装置１０に接続されており、この通信線１２を介して画像読取装置１０から送信される画像データを受け取り、制御部２１に転送する。 FIG. 2 is a block diagram illustrating a hardware configuration of the document management support apparatus 20.
The control unit 21 is, for example, a CPU (Central Processing Unit), and controls each unit of the document management support apparatus 20 by executing various software. A communication interface (hereinafter referred to as “IF”) unit 22 is connected to the image reading apparatus 10 via the communication line 12, receives image data transmitted from the image reading apparatus 10 via the communication line 12, and Transfer to the control unit 21.

表示部２３は、例えば液晶ディスプレイとその駆動回路であり、制御部２１から転送されたデータに応じた画像を表示する。操作部２４は、例えば、複数の操作子（図示省略）を備えたキーボードおよびマウス等であり、それら操作子の操作内容に応じたデータ（以下、操作内容データ）を制御部２１へ出力する。 The display unit 23 is, for example, a liquid crystal display and its drive circuit, and displays an image corresponding to the data transferred from the control unit 21. The operation unit 24 is, for example, a keyboard and a mouse provided with a plurality of operation elements (not shown), and outputs data (hereinafter, operation content data) corresponding to the operation contents of these operation elements to the control unit 21.

記憶部２５は、揮発性記憶部２５ａ、不揮発性記憶部２５ｂを含んでいる。揮発性記憶部２５ａは、例えばＲＡＭ（Random Access Memory）であり、制御部２１のワークエリアとして利用される。不揮発性記憶部２５ｂは、例えばハードディスクであり、解析処理プログラムＰ１等のプログラム及びインデックステーブルＴ１が格納されている。 The storage unit 25 includes a volatile storage unit 25a and a nonvolatile storage unit 25b. The volatile storage unit 25 a is, for example, a RAM (Random Access Memory), and is used as a work area of the control unit 21. The nonvolatile storage unit 25b is, for example, a hard disk, and stores a program such as an analysis processing program P1 and an index table T1.

以下、制御部２１が図３（ａ）乃至（ｃ）に示した文書３０乃至３２に対して、画像データの解析処理を行う動作について説明する。
文書管理支援装置２０の電源（図示略）が投入されると、制御部２１は、不揮発性記憶部２５ｂから解析処理プログラムＰ１を読み出し、図４に示すフローチャートの動作を実行する。
まず、ユーザが画像読取装置１０のＡＤＦに文書３０乃至３２をセットし、所定の操作を行うと、文書３０乃至３２を表す画像が画像読取装置１０によって順次読み取られ、その文書に対応する画像データが通信線１２を介して画像読取装置１０から文書管理支援装置２０へ順次送られる。 Hereinafter, an operation in which the control unit 21 performs image data analysis processing on the documents 30 to 32 illustrated in FIGS. 3A to 3C will be described.
When the power (not shown) of the document management support apparatus 20 is turned on, the control unit 21 reads the analysis processing program P1 from the nonvolatile storage unit 25b and executes the operation of the flowchart shown in FIG.
First, when the user sets the documents 30 to 32 in the ADF of the image reading apparatus 10 and performs a predetermined operation, images representing the documents 30 to 32 are sequentially read by the image reading apparatus 10 and image data corresponding to the documents is read. Are sequentially transmitted from the image reading apparatus 10 to the document management support apparatus 20 via the communication line 12.

制御部２１は、画像読取装置１０から送られた画像データを通信ＩＦ部２２を介して受け取ると（ステップＳＡ１０）、受け取った画像データを記憶部２５に格納する（ステップＳＡ１２）。さらに、制御部２１は文書３０乃至３２の画像データから文書画像データをそれぞれ生成する（ステップＳＡ１４）。次に、制御部２１は、文書画像データから、活字文字が記されている活字領域の画像データ、及び手書文字が記されている手書領域の画像データをそれぞれ切り出す（ステップＳＡ１６）。 When the control unit 21 receives the image data sent from the image reading device 10 via the communication IF unit 22 (step SA10), the control unit 21 stores the received image data in the storage unit 25 (step SA12). Further, the control unit 21 generates document image data from the image data of the documents 30 to 32 (step SA14). Next, the control unit 21 cuts out from the document image data, the image data of the type region in which type characters are written, and the image data of the handwritten region in which handwritten characters are written (Step SA16).

活字領域及び手書領域の切出しは以下のように行なわれる。まず、文書画像データによって表される各画素を横方向に走査し、隣り合う２つの文字の間の距離、即ち、連続する白画素の並びの幅が、所定値Ｘよりも小さかったとき、それらの連続する白画素を黒画素に置き換える。この所定値Ｘは、隣りにある文字との距離として想定される値と概ね一致させる。同様に、各画素を縦方向にも走査し、連続する白画素の並びの幅が所定値Ｙよりも小さかったとき、それらの連続する白画素を黒画素に置き換える。この所定値Ｙは、文字行の間隔として想定される値と概ね一致させる。この結果、黒画素で塗り潰された領域が形成される。図５は、文書３０の画像に上述の置き換え処理を施した状態を示すものである。この図では、黒画素で塗り潰された領域Ｌ１乃至Ｌ６が形成されている。なお、文書３１及び３２の画像についても同様の動作を行うため、以下の動作説明は、文書３０の画像を対象として行う。 The type region and the handwriting region are cut out as follows. First, each pixel represented by the document image data is scanned in the horizontal direction, and when the distance between two adjacent characters, that is, the width of the arrangement of consecutive white pixels is smaller than a predetermined value X, Are replaced with black pixels. The predetermined value X is approximately matched with a value assumed as a distance from the adjacent character. Similarly, each pixel is also scanned in the vertical direction, and when the width of the arrangement of consecutive white pixels is smaller than a predetermined value Y, those consecutive white pixels are replaced with black pixels. This predetermined value Y is generally matched with a value assumed as a character line interval. As a result, a region filled with black pixels is formed. FIG. 5 shows a state in which the above-described replacement processing is performed on the image of the document 30. In this figure, regions L1 to L6 filled with black pixels are formed. In addition, since the same operation | movement is performed also about the image of the documents 31 and 32, the following operation | movement description is performed on the image of the document 30 as object.

黒画素で塗り潰された領域が形成されると、今度は、各領域が活字領域か手書領域かの判定に移る。この判定では、まず処理対象となる注目領域を特定し、特定された領域内において置き換えられていた黒画素を白画素に戻し、元の描画内容を復元する。そして、その領域内の画素を横方向に走査し、連続する白画素のピッチのばらつきの程度が所定値よりも小さいか否か判定する。一般に、活字文字が記された領域であれば隣り合う２つの文字の間隔は概ね一定となるため、連続する白画素のピッチのばらつきの程度が所定値よりも小さくなる。一方で、手書文字が記された領域であれば隣り合う文字２つの間隔は一定とならないため、連続する白画素のピッチのばらつきの程度が所定値よりも大きくなる。図５に示したＬ１乃至Ｌ６の領域についての判定の例では、Ｌ１乃至Ｌ５の領域は活字領域であるとの判定結果が下され、Ｌ６の領域は手書領域であるとの判定結果が下される。 When an area filled with black pixels is formed, it is now determined whether each area is a type area or a handwriting area. In this determination, first, an attention area to be processed is specified, black pixels replaced in the specified area are returned to white pixels, and the original drawing content is restored. Then, the pixels in the region are scanned in the horizontal direction, and it is determined whether or not the degree of variation in pitch of consecutive white pixels is smaller than a predetermined value. In general, since the interval between two adjacent characters is substantially constant in a region where printed characters are written, the degree of variation in the pitch of consecutive white pixels becomes smaller than a predetermined value. On the other hand, since an interval between two adjacent characters is not constant in a region where a handwritten character is written, the degree of variation in pitch between consecutive white pixels becomes larger than a predetermined value. In the example of the determination for the areas L1 to L6 shown in FIG. 5, the determination result that the areas L1 to L5 are type areas is given, and the determination result that the area L6 is a handwriting area is lower. Is done.

そして、それぞれの領域についてＯＣＲ処理を施し文字認識処理を行い、活字領域から活字文字テキストデータ、手書領域から手書文字テキストデータをそれぞれ生成する（ステップＳＡ１８）。続いて、生成したテキストデータについて形態素解析を施し、それぞれのテキストデータから名詞に該当するテキストデータを抽出し（ステップＳＡ２０）、インデックステーブルＴ１に格納する（ステップＳＡ２２）。 Then, an OCR process is performed on each area to perform a character recognition process to generate type character text data from the type area and hand letter text data from the hand area (step SA18). Subsequently, the generated text data is subjected to morphological analysis, and the text data corresponding to the noun is extracted from each text data (step SA20) and stored in the index table T1 (step SA22).

ここでインデックステーブルＴ１について図６を用いて説明する。インデックステーブルＴ１は図６に示すように、「文字列」、「画像データアドレス」及び「フラグ」フィールドによって構成されている。文字列フィールドには、ステップＳＡ２０によって抽出された名詞が格納される。画像データアドレスフィールドには、文書３０の画像データの格納場所、つまり不揮発性記憶部２５ｂにおける該画像データの格納場所を示す識別子が、画像データアドレスとして格納される。フラグフィールドには、該抽出されたテキストデータが活字文字テキストデータであるか手書文字テキストデータであるかを識別する識別子が格納される。本実施形態においてフラグフィールドには、活字文字テキストデータであれば「１」、手書文字テキストデータであれば「０」をそれぞれ格納する。 Here, the index table T1 will be described with reference to FIG. As shown in FIG. 6, the index table T1 includes “character string”, “image data address”, and “flag” fields. The noun extracted in step SA20 is stored in the character string field. In the image data address field, an identifier indicating the storage location of the image data of the document 30, that is, the storage location of the image data in the nonvolatile storage unit 25b is stored as an image data address. The flag field stores an identifier for identifying whether the extracted text data is printed character text data or handwritten character text data. In the present embodiment, the flag field stores “1” if it is printed character text data, and “0” if it is handwritten character text data.

以上の結果「あいうえお」、「かきくけこ」等のテキストデータが文字列フィールドに格納される。また、これらのテキストデータに関連づけて、文書３０の画像データの画像データアドレス「０１」が、画像データアドレスフィールドに格納される。
さらに「あいうえお」、「かきくけこ」のテキストデータが含まれる領域は前述のステップＳＡ１６において活字領域（Ｌ１）であると判定されているので、活字テキストデータであることを示す「１」がそれぞれのテキストデータに関連づけてフラグフィールドに格納される。テキストデータ「いろは」についても同じように、テキストデータである「いろは」と、文書３０の画像の格納場所を示す識別子「０１」とが関連づけて格納される。該テキストデータは領域Ｌ６に含まれ、領域Ｌ６は前述のステップＳＡ１６において手書領域であると判定されているので、フラグフィールドには該テキストデータと関連づけて「０」が格納される。 As a result of the above, text data such as “Aiueo” and “Kakikukeko” is stored in the character string field. Further, in association with these text data, the image data address “01” of the image data of the document 30 is stored in the image data address field.
Furthermore, since the area including the text data of “Aiueo” and “Kakikukeko” is determined to be the type area (L1) in the above step SA16, “1” indicating the type text data is respectively set. Is stored in the flag field in association with the text data. Similarly, the text data “Iroha” is stored in association with the text data “Iroha” and the identifier “01” indicating the storage location of the image of the document 30. Since the text data is included in the area L6 and the area L6 is determined to be a handwriting area in the above-described step SA16, “0” is stored in the flag field in association with the text data.

次に、制御部２１がキーワードを元にインデックステーブルＴ１を検索する検索動作について説明する。
図７は検索動作を示すフローチャートである。
まず、制御部２１は、変数として、該当活字文字列数ｘ、該当手書文字列数ｙ及び該当画像データ数ｚを設定し、それぞれの変数に初期値を入力する（ステップＳＢ１０）。 Next, a search operation in which the control unit 21 searches the index table T1 based on the keyword will be described.
FIG. 7 is a flowchart showing the search operation.
First, the control unit 21 sets the corresponding type character string number x, the corresponding handwritten character string number y, and the corresponding image data number z as variables, and inputs initial values to the respective variables (step SB10).

続いて、図８に示したキーワード入力画面４０を表示部２３に表示し、ユーザにキーワードを入力させる（ステップＳＢ１２）。図８に示したように、キーワード入力画面４０は入力ボックス４１及び「決定」ボタンＢＴ１により構成されている。このキーワード入力画面４０を視認したユーザは、操作部２４を操作し、決定ボタンＢＴ１を押下することによって、キーワードを指定することができる。このとき、ユーザによって入力された文字列は「あいうえお」であったとする。操作部２４は「あいうえお」を表す文字データを制御部２１に供給する。 Subsequently, the keyword input screen 40 shown in FIG. 8 is displayed on the display unit 23, and the user is allowed to input keywords (step SB12). As shown in FIG. 8, the keyword input screen 40 includes an input box 41 and a “OK” button BT1. A user who visually recognizes the keyword input screen 40 can specify a keyword by operating the operation unit 24 and pressing the enter button BT1. At this time, it is assumed that the character string input by the user is “Aiueo”. The operation unit 24 supplies character data representing “Aiueo” to the control unit 21.

制御部２１は、文字データを受け取ると、文字データの内容、すなわち「あいうえお」を元にインデックステーブルＴ１の文字列フィールドを検索する（ステップＳＢ１４）。このとき、インデックステーブルＴ１の文字列フィールドに、文字データ「あいうえお」と同一のテキストデータが格納されているレコードは３つある。
制御部２１は、これらのレコード群を順次揮発性記憶部２５ｂに読み出す（ステップＳＢ１６）。図９は読み出したレコード群の一例である。 When receiving the character data, the control unit 21 searches the character string field of the index table T1 based on the content of the character data, that is, “Aiueo” (step SB14). At this time, there are three records in which the text data identical to the character data “Aiueo” is stored in the character string field of the index table T1.
The control unit 21 sequentially reads out these record groups to the volatile storage unit 25b (step SB16). FIG. 9 shows an example of the read record group.

続いて、制御部２１は、抽出したレコード群のフラグフィールドにおいて「１」が格納されているレコードの数を該当活字文字列数ｘに、「０」が格納されているレコードの数を該当手書文字列数ｙに代入する（ステップＳＢ１８）。すなわち、抽出したレコード群のフラグフィールドにおいて「１」が格納されているレコード数は２つ、「０」が格納されているレコード数は１つであるので、制御部２１は該当活字文字列数ｘ＝２、該当手書文字列数ｙ＝１とする。 Subsequently, the control unit 21 sets the number of records in which “1” is stored in the flag field of the extracted record group as the number of type character strings x, and the number of records in which “0” is stored as the corresponding number. Substituted in the number of handwritten character strings y (step SB18). That is, since the number of records in which “1” is stored is two and the number of records in which “0” is stored is one in the flag field of the extracted record group, the control unit 21 determines the number of corresponding type character strings. It is assumed that x = 2 and the number of handwritten character strings y = 1.

次に制御部２１は、抽出したレコード群のアドレスフィールドにおいて、異なるアドレスがいくつ記憶されているかを検出し、検出した個数を該当画像データ数ｚに代入する（ステップＳＢ２０）。このとき、異なるアドレスは、「０１」、「０３」の２個である。従って、制御部２１は、ｚ＝２とする。 Next, the control unit 21 detects how many different addresses are stored in the address field of the extracted record group, and substitutes the detected number for the corresponding image data number z (step SB20). At this time, there are two different addresses, “01” and “03”. Therefore, the control unit 21 sets z = 2.

続いて、制御部２１は、抽出したレコード群のアドレスフィールドのうち、アドレスの若い順に画像データを読み出す（ステップＳＢ２２）。すなわち、一番若いアドレスは、「０１」であるため、制御部２１は、まず、アドレス０１に格納されている画像データを読み出す。このとき、アドレス０１に格納されている画像データは文書３０の画像データであるので、制御部２１によって、この文書３０の画像データが読み出されることになる。 Subsequently, the control unit 21 reads out the image data in ascending order of addresses in the extracted address field of the record group (step SB22). That is, since the youngest address is “01”, the control unit 21 first reads the image data stored at the address 01. At this time, since the image data stored at the address 01 is the image data of the document 30, the image data of the document 30 is read out by the control unit 21.

そして、制御部２１は、文書３０の画像データ、前述の該当活字文字列数ｘ、該当手書文字列数ｙ及び該当画像データ数ｚから生成した検索結果画面４０を表示部２３に表示する（ステップＳＢ２４）。
図１０は検索結果画面４０を示した図である。図に示したように、検索結果画面４０は表示領域４２、検索結果欄及び「次画像表示」ボタンＢＴ２によって構成されている。表示領域４２には、文書３０を表す画像が表示されている。また、検索結果欄には該当活字文字列数ｘ、該当手書文字列数ｙ及び該当画像データ数ｚがそれぞれ表示されている。 Then, the control unit 21 displays on the display unit 23 the search result screen 40 generated from the image data of the document 30, the above-mentioned corresponding type character string number x, the corresponding handwritten character string number y, and the corresponding image data number z ( Step SB24).
FIG. 10 shows the search result screen 40. As shown in the figure, the search result screen 40 includes a display area 42, a search result column, and a “next image display” button BT2. In the display area 42, an image representing the document 30 is displayed. In the search result column, the number x of the corresponding type character string, the number y of the corresponding handwritten character string, and the number z of the corresponding image data are displayed.

検索結果画面４０を視認したユーザにより、次画像表示ボタンＢＴ２が押下されると、制御部２１は、ステップＳＢ１６において抽出したレコード群のアドレスフィールドにおいて、２番目に若いアドレスに格納されている画像データを読み出し、読み出した画像データの画像を領域４２に表示する。このとき、２番目に若いアドレスに格納されている画像は文書３２の画像であるため、領域４２には文書３２の画像が表示される。このように制御部２１は該当する画像を領域４２に順次表示する。 When the user viewing the search result screen 40 presses the next image display button BT2, the control unit 21 stores the image data stored at the second youngest address in the address field of the record group extracted in step SB16. And the image of the read image data is displayed in the area 42. At this time, since the image stored at the second youngest address is the image of the document 32, the image of the document 32 is displayed in the area 42. As described above, the control unit 21 sequentially displays the corresponding images in the area 42.

以上説明したように、本実施形態においては、文書に手書で記載されたアノテーションについても文字認識を行い、インデックステーブルを作成するので、文書に記載されている活字文字のみならず、手書文字に対しても検索を行うことができる。従って、手書文字に重要な情報が含まれている場合や、他者に伝えたいコメントが含まれている場合等、それらの情報も検索結果としてユーザに提示することができる。 As described above, in the present embodiment, character recognition is also performed for annotations written in handwriting on a document and an index table is created, so that not only typed characters written in the document but also handwritten characters You can also search for. Therefore, such information can be presented to the user as a search result when important information is included in the handwritten character or when a comment desired to be communicated to others is included.

[変形例]
本発明は上述した実施形態以外に種々の形態で実施可能である。
（１）手書文字は人によって癖があるため、本来とは違う文字に認識する虞がある。従って、誤認識が起こりやすい文字、例えば「ツ」と「シ」や、「ソ」と「ン」等の文字について、それらを関連づけた類似文字辞書テーブルを不揮発性記憶部２５ｂに記憶させておき、インデックステーブルＴ１にテキストデータを格納する際、類似文字辞書テーブルを元に作成したテキストデータも格納するようにしても良い。具体的には、ステップＳＡ１８（図４参照）における文字認識の結果、「ペーヅ」という文字列を取得した際、ステップＳＡ２２において、「ペーヅ」の文字列を示すテキストデータをインデックステーブルＴ１に格納すると共に、類似文字辞書テーブルを元に変換した「ページ」の文字列を示すテキストデータをインデックステーブルＴ１に格納する。このときのインデックステーブルＴ１を図１１に例示する。このようにすることで、人の手書文字の癖によって、本来とは違う文字列を示すテキストデータがインデックステーブルＴ１に格納されていても、類似文字辞書テーブルによって変換されたテキストデータもインデックステーブルＴ１に格納されるため、検索動作において、本来は抽出されるべき画像データが、手書文字の癖の影響で抽出されないということを防ぐことができる。 [Modification]
The present invention can be implemented in various forms other than the above-described embodiments.
(1) Since handwritten characters are wrinkled by people, they may be recognized as characters different from the original. Therefore, the non-volatile storage unit 25b stores a similar character dictionary table in which characters that are likely to be erroneously recognized, such as “tsu” and “shi” or “so” and “n”, are associated with each other. When text data is stored in the index table T1, text data created based on the similar character dictionary table may also be stored. Specifically, when a character string “page” is acquired as a result of character recognition in step SA18 (see FIG. 4), text data indicating the character string “page” is stored in the index table T1 in step SA22. At the same time, the text data indicating the character string of “page” converted based on the similar character dictionary table is stored in the index table T1. An index table T1 at this time is illustrated in FIG. In this way, even if text data indicating a character string different from the original character string is stored in the index table T1 due to human handwriting characters, the text data converted by the similar character dictionary table is also stored in the index table. Since it is stored in T1, it can be prevented that the image data that should be extracted in the search operation is not extracted due to the influence of handwriting characters.

（２）また、同義語を関連づけたテーブル（同義語テーブル）を不揮発性記憶部２５ｂに格納しておき、この同義語テーブルを元に拡張したキーワードでインデックステーブルＴ１を検索するようにしても良い。例えば、同義語テーブルに「白黒」と「モノクロ」とが同義語として格納されており、前述のキーワード入力画面４０（図８参照）において、ユーザが入力したキーワードが「白黒」であった場合、制御部２１は「白黒」及び「モノクロ」を表すキーワードを元にインデックステーブルＴ１を検索するようにする。このようにすることで、拡張した条件で画像データを検索することができる。 (2) Further, a table (synonym table) in which synonyms are associated may be stored in the non-volatile storage unit 25b, and the index table T1 may be searched with keywords expanded based on the synonym table. . For example, when “monochrome” and “monochrome” are stored as synonyms in the synonym table and the keyword input by the user on the keyword input screen 40 (see FIG. 8) is “monochrome”, The control unit 21 searches the index table T1 based on keywords representing “monochrome” and “monochrome”. In this way, it is possible to search for image data under expanded conditions.

（３）また、本実施形態においては、制御部２１が検索動作を行う際、文字列データのみを用いて検索する場合について示したが、文字列データに加え、活字文字テキストデータ又は手書文字テキストデータのいずれから検索するかという指定を元に検索するようにしてもよい。この場合の一例として、制御部２１は、前述のステップＳＢ１２において、図１２に示したキーワード入力画面５０を表示する。図に示したように、キーワード入力画面５０はキーワード入力画面４０（図８参照）に、活字文字チェックボックス５１及び手書文字チェックボックス５２によって構成されたチェックボックスが付加されている。制御部２１はこれらのチェックボックスの内容に応じて検索する。例えば、ユーザによってキーワード入力画面５０の入力ボックス４１に入力された文字列が「あいうえお」であり、活字文字チェックボックス５１がチェックされた場合、制御部２１は、文字列フィールドに「あいうえお」のテキストデータが、フラグフィールドに「１」が格納されているレコードを検索する。
このようにすることで、ユーザの検索目的により合致した条件で検索をすることができる。 (3) In the present embodiment, the case where the control unit 21 performs a search operation using only character string data has been described. However, in addition to character string data, type character text data or handwritten characters You may make it search based on designation | designated from which of text data to search. As an example of this case, the control unit 21 displays the keyword input screen 50 shown in FIG. 12 in step SB12 described above. As shown in the figure, in the keyword input screen 50, a check box constituted by a printed character check box 51 and a handwritten character check box 52 is added to the keyword input screen 40 (see FIG. 8). The control unit 21 searches according to the contents of these check boxes. For example, when the character string input by the user in the input box 41 of the keyword input screen 50 is “Aiueo” and the type character check box 51 is checked, the control unit 21 displays the text “Aiueo” in the character string field. The data is searched for a record in which “1” is stored in the flag field.
By doing in this way, it is possible to perform a search under conditions that match the search purpose of the user.

（４）また、本実施形態においては、活字文字と手書文字を同一のテーブル（インデックステーブルＴ１）内に格納する場合を示したが、それぞれを別のテーブルに格納するようにしてもよい。 (4) In the present embodiment, a case has been described in which type characters and handwritten characters are stored in the same table (index table T1), but each may be stored in a separate table.

（５）また、本実施形態においては、画像読取装置１０と文書管理支援装置２０とをそれぞれ個別のハードウェアとして構成する場合を示したが、両者を一体のハードウェアで構成するようにしてもよい。このとき、通信線１２は、係るハードウェア内で画像読取装置１０と文書管理支援装置２０とを接続する内部バスとなる。 (5) In the present embodiment, the case where the image reading device 10 and the document management support device 20 are configured as separate hardware has been described. However, both may be configured as integrated hardware. Good. At this time, the communication line 12 serves as an internal bus for connecting the image reading apparatus 10 and the document management support apparatus 20 within the hardware.

本発明の実施形態に係るシステムの全体構成を示すブロック図である。1 is a block diagram showing an overall configuration of a system according to an embodiment of the present invention. 同実施形態に係る文書管理支援装置のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the document management assistance apparatus which concerns on the same embodiment. 同実施形態に係る文書の構成のフォーマットを示した図である。FIG. 3 is a diagram illustrating a format of a document configuration according to the embodiment. 同実施形態に係る制御部が行う動作を示したフローチャートを示した図である。It is the figure which showed the flowchart which showed the operation | movement which the control part which concerns on the same embodiment performs. 同実施形態に係る文書に黒画素への置き換えを行った状態を示した図である。It is the figure which showed the state which substituted to the black pixel for the document concerning the embodiment. 同実施形態に係るインデックステーブルのフォーマットを示した図である。It is the figure which showed the format of the index table which concerns on the same embodiment. 同実施形態に係る制御部が行う動作を示したフローチャートを示した図である。It is the figure which showed the flowchart which showed the operation | movement which the control part which concerns on the same embodiment performs. 同実施形態に係るキーワード入力画面を示した図である。It is the figure which showed the keyword input screen which concerns on the same embodiment. 同実施形態に係る制御部が読み出したレコード群を示した図である。It is the figure which showed the record group which the control part which concerns on the same embodiment read. 同実施形態に係る検索結果画面を示した図である。It is the figure which showed the search result screen which concerns on the same embodiment. 本発明の変形例に係るインデックステーブルのフォーマットを示した図である。It is the figure which showed the format of the index table which concerns on the modification of this invention. 本発明の変形例に係るキーワード入力画面を示した図である。It is the figure which showed the keyword input screen which concerns on the modification of this invention.

Explanation of symbols

１・・・文書管理支援システム、１０・・・画像読取装置、１２・・・通信線、、２０・・・文書管理支援装置、２１・・・制御部、２２・・・通信ＩＦ部、２３・・・表示部、２４・・・操作部、２５・・・記憶部、２５ａ・・・揮発性記憶部、２５ｂ・・・不揮発性記憶部、２６・・・バス DESCRIPTION OF SYMBOLS 1 ... Document management support system, 10 ... Image reading device, 12 ... Communication line, 20 ... Document management support device, 21 ... Control part, 22 ... Communication IF part, 23 ... Display unit, 24 ... Operation unit, 25 ... Storage unit, 25a ... Volatile storage unit, 25b ... Non-volatile storage unit, 26 ... Bus

Claims

Document image data generation means for scanning a document and obtaining document image data representing the content of the document;
Storage means for storing the document image data generated by the document image data generation means;
From the document image data generated by the document image data generation means, image data of a type region where one or more type characters are written, and an image of a handwritten region where one or more type characters are written Area separation means for cutting out data;
Character recognition processing means for performing character recognition processing on each of the image data of the type region and the image data of the handwriting region, and outputting a recognized character string,
A storage location identifier indicating the storage location of each image data stored in the storage means, a recognized character string output from the character recognition processing means, and whether the recognized character string is a character string representing image data in a handwriting area A document management support apparatus comprising: index information storage means for storing a character identifier indicating whether it is a character string representing image data in a print area in association with each other.

The region separating means specifies a region of interest from the document image data;
A distance between two characters adjacent to each other in the attention area is obtained for each set of the two character strings, and when the degree of variation in the obtained distances does not exceed a predetermined range, the attention area is a type area. The document management support apparatus according to claim 1, further comprising: a unit that determines that the attention area is a handwritten area when the degree of variation exceeds a predetermined range.

Search means for searching index information by collating a character string input as a search key with a recognized character string stored in the index information storage means;
The document management support apparatus according to claim 1, further comprising: a display unit that displays a search result of the search unit.

Comprising designation means for designating from which of the handwriting area and the type area the character string to be the search key is searched;
4. The document management support apparatus according to claim 3, wherein the search means performs the search for the area designated by the designation means.