JP2005208979A

JP2005208979A - Featured value extracting device and method and document filing device

Info

Publication number: JP2005208979A
Application number: JP2004015509A
Authority: JP
Inventors: Hitoshi Okamoto; 仁岡本; Kagenori Nagao; 景則長尾; Shinichi Yada; 伸一矢田
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2004-01-23
Filing date: 2004-01-23
Publication date: 2005-08-04

Abstract

<P>PROBLEM TO BE SOLVED: To solve the problem that when a pixel value is accumulated as it is in a projecting direction, the base color of paper or the background color of an original and image components and ruled line components inserted into a document are also accumulated, and character components in a projection waveform are not clear, and the extraction of any featured value can not be correctly operated. <P>SOLUTION: Projection waveform is generated by a projection waveform generating part 21 from inputted multi-value document data. Concretely, the pixel values of each document data are accumulated in horizontal and vertical directions so that projection waveform data can be generated. The projection waveform data are binarized by a binarizing part 22, and a binary data sequence constituted only of a black region and a white region are acquired. Then, the length of the continuous black regions in the binary data sequence is all calculated by a frequency distribution analyzing part 23, and the value whose appearance frequency is higher than a predetermined threshold value among those values is set as the featured value of the document data. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、特徴量抽出装置および特徴量抽出方法、ならびに文書ファイリング装置に関し、特に文書画像の投影波形を利用して文書のレイアウトや文字サイズ、行間隔などの特徴量を抽出する特徴量抽出装置および特徴量抽出方法、ならびに当該特徴量抽出装置を用いて文書単位の区切り位置を判定し、文書毎に文書データの管理を行う文書ファイリング装置に関する。 The present invention relates to a feature amount extraction device, a feature amount extraction method, and a document filing device, and more particularly to a feature amount extraction device that extracts feature amounts such as document layout, character size, and line spacing using a projected waveform of a document image. The present invention also relates to a document filing apparatus that determines a delimiter position of a document unit by using the feature quantity extraction method, and manages document data for each document.

近年、１枚以上の紙原稿からなる文書を複数部に亘って効率的に電子化する場合、自動給紙機能を持つスキャナ装置を用いて当該原稿を連続的に読み取る処理が広く行われている。この際、読み取った原稿画像データ（文書データ）を文書毎に管理するためには、文書と文書との間の切れ目を何らかの手法を用いて検出する必要がある。 In recent years, when a document composed of one or more paper originals is efficiently digitized over a plurality of copies, a process of continuously reading the originals using a scanner device having an automatic paper feed function has been widely performed. . At this time, in order to manage the read document image data (document data) for each document, it is necessary to detect a break between the documents using some method.

文書と文書との間の切れ目を検出するために、従来は、スキャナ装置で読み取った画像データから、あらかじめ設定しておいた文字認識エリア部分を切り出して文字認識し、この文字認識の結果に基づいて文書の区切りを判定するようにしたり（例えば、特許文献１参照）、自動給紙機能を持つスキャナ装置により複数の文書を一括して読み取り、読み取った原稿画像の特徴量を算出し、この算出した特徴量に基づいて文書単位の区切りを判定するようにしていた（例えば、特許文献２参照）。 In order to detect a break between documents, conventionally, a character recognition area portion set in advance is cut out from image data read by a scanner device, and character recognition is performed. Based on the result of character recognition, The document separation is determined (for example, see Patent Document 1), a plurality of documents are read at once by a scanner device having an automatic paper feed function, and the feature amount of the read original image is calculated, and this calculation is performed. The document unit break is determined based on the feature amount (see, for example, Patent Document 2).

これらいずれの従来技術も、あらかじめ文書区切り用原稿（例えば、白紙原稿）を各文書間に挿入したり、文書区切りとなる原稿に手を加えたりすることなく、文書の区切りを自動的に判定することができるために、ユーザに強いる負担を大幅に軽減することができる。特に、特許文献２に係る従来技術では、対象原稿のフォーマットに制約がなく、より広範な種類の文書に対応できる。これに対して、特許文献１に係る従来技術では、特定フォーマットの原稿にしか対応できない。 In any of these conventional techniques, document separation is automatically determined without inserting a document separation document (for example, a blank document) between each document in advance or modifying the document to be document separation. Therefore, the burden on the user can be greatly reduced. In particular, with the conventional technique according to Patent Document 2, there is no restriction on the format of the target document, and it is possible to deal with a wider variety of documents. On the other hand, the prior art according to Patent Document 1 can only deal with a document of a specific format.

ところで、特許文献２に係る従来技術では、区切り位置判定の根拠となる特徴量の例として、画像データに利用されている色相、スクリーン線数、原稿のレイアウト、文字の組み方向などを挙げている。しかし、一般のオフィス文書では、画像を含まない文字のみの原稿も多いために、画像の色相やスクリーン線数などのような特徴量を、区切り位置の判定に利用できない場合が多い。また、横書きが一般的であることから、文字の組み方向も文書の区切り位置判定には役立たない場合が多い。 By the way, in the prior art according to Patent Document 2, as examples of the feature quantity that is the basis for the delimiter position determination, the hue used for the image data, the number of screen lines, the layout of the document, the direction of character combination, etc. . However, in general office documents, there are many text-only originals that do not include images, and thus feature quantities such as the hue of the image and the number of screen lines cannot often be used to determine the separation position. In addition, since horizontal writing is common, the direction in which the characters are combined is often not useful for determining the document break position.

一方、文書のレイアウトや文字サイズ、行間隔などの特徴量は、オフィス文書でも文書ごとに異なるのが一般的である。したがって、これらの特徴量を文書の区切り位置の判定に利用するのは有効な手法である。文書のレイアウトや文字サイズ、行間隔などの特徴量を文書画像から抽出する方法としては、文書画像の縦・横方向の投影波形を形成して、当該投影波形を利用するのが一般的である。 On the other hand, feature quantities such as document layout, character size, and line spacing are generally different for each office document. Therefore, it is an effective technique to use these feature amounts for determination of the document break position. As a method for extracting feature quantities such as document layout, character size, and line spacing from a document image, it is common to form a projection waveform in the vertical and horizontal directions of the document image and use the projection waveform. .

具体的には、文書画像を二値化して縦・横方向に投影をとり、これらの投影波形に対してしきい値処理を行うことによって文字領域／空白領域を判定し、その判定結果から文字サイズや行間隔などを検知するようにしていた（例えば、特許文献３参照）。また、文書画像を二値化して、その後横方向に投影をとることによって行の切り出しを行い、次いで分離した各行の画像において縦方向に投影をとることによって各文字を分離し、さらに各文字について再度横方向の投影をとることによって文字の分離を修正することで文字サイズを検知するようにしていた（例えば、特許文献４参照）。 Specifically, the document image is binarized, projected in the vertical and horizontal directions, and a threshold value process is performed on these projection waveforms to determine a character area / blank area. The size and line spacing are detected (for example, see Patent Document 3). Also, binarization of the document image is performed, and then a line is cut out by projecting in the horizontal direction, and then each character is separated by projecting in the vertical direction in the image of each separated line. The character size is detected by correcting the separation of the characters by taking the horizontal projection again (see, for example, Patent Document 4).

文字サイズを抽出するに当たっては、まず文書画像を二値化して、その後横方向に投影をとって行の切り出しを行し、次に分離した各行の画像において縦方向に投影をとって各文字を分離し、さらに各文字について再度横方向の投影をとり、文字の分離を修正することで文字サイズを抽出するようにしていた（例えば、特許文献５参照）。また、各文字または文字列に外接するブロックの情報を抽出して標準的な文字ブロックサイズを算出し、これと外接ブロック情報とを比較して続け字ブロックを検出し、一行分の文字列の投影分布およびその頻度分布を算出し、頻度分布より続け字ブロックを１文字サイズの外接ブロックに分離するのに最適なしきい値を算出することで、手書き文字における文字サイズ検知の精度を高めるようにしていた（例えば、特許文献６参照）。 In extracting the character size, first, the document image is binarized, and then the projection is performed in the horizontal direction to cut out the rows, and then the projection is performed in the vertical direction in the separated images of each row. Then, the character size is extracted by taking a horizontal projection again for each character and correcting the character separation (see, for example, Patent Document 5). In addition, information on blocks circumscribing each character or character string is extracted to calculate a standard character block size, and this is compared with the circumscribing block information to continuously detect character blocks. By calculating the projection distribution and its frequency distribution, and calculating the optimum threshold for separating character blocks into circumscribed blocks of one character size from the frequency distribution, the accuracy of character size detection in handwritten characters is improved. (For example, refer to Patent Document 6).

特開平１０−２１３８０号公報Japanese Patent Laid-Open No. 10-21380 特開２００２−２４２５８号公報Japanese Patent Laid-Open No. 2002-24258 特公平７−１１１７３８号公報Japanese Patent Publication No. 7-1111738 特開平５−８９２８３号公報JP-A-5-89283 特開平５−８９２８３号公報JP-A-5-89283 特開平７−９８７４７号公報JP-A-7-98747

上述したように、投影波形を用いる特許文献３乃至６に係る従来技術ではいずれも、図１０に示すように、読み込んだ文書画像をまず二値化し、この二値化画像に対する投影波形を得て、この投影波形を基に特徴量の抽出を行っている。したがって、処理結果は二値化しきい値に大きく影響される。ところが、二値化しきい値を適切な値に設定しても、当該二値化しきい値は、紙の地色や原稿の背景色、文書中に挿入された画像の有無、文字の色・濃度に大きく左右されるため、安定した処理結果を得るのは難しい。 As described above, in each of the related arts according to Patent Documents 3 to 6 using the projection waveform, as shown in FIG. 10, the read document image is first binarized, and the projection waveform for the binarized image is obtained. The feature amount is extracted based on the projected waveform. Therefore, the processing result is greatly affected by the binarization threshold. However, even if the binarization threshold value is set to an appropriate value, the binarization threshold value still depends on the background color of the paper, the background color of the document, the presence / absence of an image inserted in the document, the character color / density. Therefore, it is difficult to obtain a stable processing result.

図１１に示すように、二値化せずに多値のまま投影をとれば、上記の二値化レベルの問題を解消することができる。しかしながら、紙の地色や原稿の背景色が濃い場合、文書中に画像が挿入されている場合、文字の濃度が低い場合、あるいは行が短い場合などは、文字部の投影と地の部分における投影の差がはっきりしなくなる。また、低濃度のノイズが多く含まれた文書画像の場合、二値化処理を行えばノイズは白レベルと判定されるため、結果的にノイズの影響を排除できる。しかしながら、多値のまま投影をとるとノイズ成分も累積されることになるため、この点でも処理が難しくなる。 As shown in FIG. 11, the above binarization level problem can be solved by taking a multi-value projection without binarization. However, if the background color of the paper or the background color of the document is dark, the image is inserted in the document, the character density is low, or the line is short, etc. The difference in projection is not clear. Further, in the case of a document image containing a lot of low density noise, if binarization processing is performed, the noise is determined to be a white level, and as a result, the influence of noise can be eliminated. However, if projection is performed with multiple values, noise components are also accumulated, which makes processing difficult in this respect.

すなわち、文字画像上の文字サイズや行間隔などの特徴量を検出するのに投影波形を用いる従来技術では、投影方向に画素値をそのまま累積してしまうと、紙の地色や原稿の背景色、文書中に挿入された画像成分、罫線成分などについても累積してしまうため、投影波形中の文字成分がはっきりしなくなる。このように、投影波形中の文字成分がはっきりせず、投影波形のどの部分が文字部に相当するかが不明確であると、投影波形から文字サイズや行間隔などの特徴量を検出する処理を正しく行えないことになる。しかも、文書から抽出した文字サイズが、その文書を代表する特徴量であるとは限らない。 That is, in the conventional technology that uses a projection waveform to detect a feature amount such as a character size or line spacing on a character image, if the pixel values are accumulated as they are in the projection direction, the background color of the paper or the background color of the document Since the image components and ruled line components inserted in the document are also accumulated, the character components in the projected waveform are not clear. As described above, when the character component in the projected waveform is not clear and it is unclear which part of the projected waveform corresponds to the character part, the process of detecting the feature amount such as the character size and the line spacing from the projected waveform. Cannot be performed correctly. In addition, the character size extracted from the document is not necessarily a feature amount representing the document.

本発明は、上記課題に鑑みてなされたものであって、その目的とするところは、紙の地色や原稿の背景色、文書中に挿入された画像成分、罫線成分などの外乱の影響を受けることなく、文字サイズや行間隔などの特徴量の抽出を正しく行うことが可能な特徴量抽出装置、特徴量抽出方法、ならびに当該特徴量抽出装置を用いた文書ファイリング装置を提供することにある。 The present invention has been made in view of the above problems, and the object of the present invention is to influence the influence of disturbances such as the background color of the paper, the background color of the document, the image component inserted in the document, and the ruled line component. To provide a feature amount extraction device, a feature amount extraction method, and a document filing device using the feature amount extraction device, which can correctly extract feature amounts such as character size and line spacing without receiving them. .

上記目的を達成するために、本発明では、入力される多値の文書データから投影波形データを生成し、この生成した投影波形データを二値化する。そして、この二値化結果における連続する黒領域及び白領域の少なくともどちらか一方についての特徴を示す値の総てについて求めて頻度分布を解析し、その解析結果を前記文書データの特徴量としたり、あるいはこのようにして得た特徴量に基づいて文書単位の区切りを判定したりする。 In order to achieve the above object, in the present invention, projection waveform data is generated from input multi-valued document data, and the generated projection waveform data is binarized. Then, the frequency distribution is analyzed by obtaining all the values indicating the characteristics of at least one of the continuous black area and the white area in the binarization result, and the analysis result is used as the feature amount of the document data. Alternatively, the document unit break is determined based on the feature amount obtained in this way.

投影波形を用いて文字画像上の文字サイズや行間隔などの特徴量を抽出するに当たり、多値の文書データから直接投影波形データを生成し、この生成した投影波形データを二値化することにより、黒領域・白領域のみからなる二値データ系列が得られる。そして、この二値データ系列における連続する黒領域及び白領域の少なくともどちらか一方についての特徴を示す値の総てについて求めて頻度分布を解析し、解析結果を文書データの特徴量とすることで、ノイズが混在した複数の特徴量の中から有効な特徴量を選択することができる。 When extracting feature quantities such as character size and line spacing on a character image using the projection waveform, the projection waveform data is generated directly from the multivalued document data, and the generated projection waveform data is binarized. Thus, a binary data series consisting of only black areas and white areas is obtained. Then, the frequency distribution is analyzed by obtaining all the values indicating the characteristics of at least one of the continuous black area and the white area in the binary data series, and the analysis result is used as the feature amount of the document data. An effective feature amount can be selected from a plurality of feature amounts mixed with noise.

本発明によれば、ノイズが混在した複数の特徴量の中から有効な特徴量を選択することができるため、これら特徴量に基づく文書単位の区切り判定を正しく行うことができる。また、区切り判定の自動化が可能になるため、手動による区切り作業におけるユーザの負担を軽減できるとともに、効率的な区切り作業が可能になる。 According to the present invention, since an effective feature amount can be selected from a plurality of feature amounts mixed with noise, it is possible to correctly determine a document unit based on these feature amounts. In addition, since the separation determination can be automated, the burden on the user in the manual separation work can be reduced, and efficient separation work can be performed.

以下、本発明の実施の形態について図面を参照して詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明の一実施形態に係る文書ファイリング装置の構成例を示すブロック図である。図１から明らかなように、本実施形態に係る文書ファイリング装置は、文書入力部１１、特徴量抽出部１２、類似度評価部１３、文書蓄積部１４、文書区切り部１５および文書出力部１６を備え、これらの構成要素がバスライン１７を介して相互に接続された構成となっている。かかる構成の文書ファイリング装置において、特徴量抽出部１２が本発明の最も特徴とする部分である。 FIG. 1 is a block diagram illustrating a configuration example of a document filing apparatus according to an embodiment of the present invention. As is apparent from FIG. 1, the document filing apparatus according to the present embodiment includes a document input unit 11, a feature amount extraction unit 12, a similarity evaluation unit 13, a document storage unit 14, a document delimiter unit 15, and a document output unit 16. These components are connected to each other via a bus line 17. In the document filing apparatus having such a configuration, the feature quantity extraction unit 12 is the most characteristic part of the present invention.

文書入力部１１は、入力文書から文書データを取得して、これを本文書ファイリング装置に登録すべき文書データとして入力する。ここで、入力される文書データとしては、例えば、印刷物からスキャンされて取得された多値画像データ、あるいはデジタルカメラで撮影した多値画像データなどが挙げられる。これに対応して、文書入力部１１としては、例えば、ＡＤＦ(Auto Document Feeder)を備えたスキャナ装置とその制御手段、あるいはデジタルカメラのメモリ（カード）に蓄積された画像を連続して取り出すメモリリーダ装置とその制御手段が用いられる。 The document input unit 11 acquires document data from the input document and inputs it as document data to be registered in the document filing apparatus. Here, the input document data includes, for example, multi-value image data obtained by scanning from a printed material, or multi-value image data captured by a digital camera. Correspondingly, as the document input unit 11, for example, a scanner device provided with an ADF (Auto Document Feeder) and its control means, or a memory for continuously extracting images stored in a memory (card) of a digital camera A reader device and its control means are used.

特徴量抽出部１２は、文書入力部１１により入力された多値の文書データ（多値の画像データ）から、当該文書データ固有の特徴を示す量（以下、「特徴量」と記す）を抽出する。本例に係る特徴量抽出部１２では、図２に示すように、各文書データの画素値を水平・垂直方向に累積した投影波形を形成し、この投影波形より文書データを代表する文字の高さｈおよび幅ｗを例えばｎ個抽出する。ここで、投影波形データは各文書データの縦方向および横方向の画素値の積算値あるいは平均値であり、例えば、各文書データの縦方向あるいは横方向が総て黒であった場合の投影データの値は“２５５”であり、総て白であった場合は“０”とする。 The feature quantity extraction unit 12 extracts a quantity (hereinafter referred to as “feature quantity”) indicating a characteristic unique to the document data from the multivalued document data (multivalued image data) input by the document input unit 11. To do. In the feature quantity extraction unit 12 according to this example, as shown in FIG. 2, a projection waveform is formed by accumulating pixel values of each document data in the horizontal and vertical directions, and the height of a character representing the document data is determined from the projection waveform. For example, n pieces of height h and width w are extracted. Here, the projection waveform data is the integrated value or average value of the vertical and horizontal pixel values of each document data. For example, the projection data when the vertical or horizontal direction of each document data is all black. The value of "255" is "0" when all are white.

図３は、特徴量抽出部１２の具体的な構成の一例を示すブロック図である。図３から明らかなように、本例に係る特徴量抽出部１２は、投影波形生成部２１、二値化部２２および頻度分布解析部２３を有する構成となっている。 FIG. 3 is a block diagram illustrating an example of a specific configuration of the feature amount extraction unit 12. As is clear from FIG. 3, the feature quantity extraction unit 12 according to this example has a configuration including a projection waveform generation unit 21, a binarization unit 22, and a frequency distribution analysis unit 23.

投影波形生成部２１は、入力される多値の文書データから直接に投影波形を形成する。具体的には、上述したように、各文書データの画素値を水平・垂直方向に累積して（あるいは、平均をとって）投影波形データを生成する。二値化部２２は、投影波形生成部２１によって生成された投影波形データの二値化処理を行い、黒領域・白領域のみからなる二値データ系列を出力する。頻度分布解析部２３は、二値化部２２から出力される二値データ系列における連続する黒領域の長さを総てについて求め、例えば図４に示すように、それらの値のうち出現頻度が所定のしきい値よりも高いｎ／２個をその出現頻度とともにこの文書データの特徴量とする。 The projection waveform generation unit 21 forms a projection waveform directly from the input multivalued document data. Specifically, as described above, the projection waveform data is generated by accumulating the pixel values of each document data in the horizontal and vertical directions (or taking the average). The binarization unit 22 performs binarization processing on the projection waveform data generated by the projection waveform generation unit 21 and outputs a binary data series consisting of only a black region and a white region. The frequency distribution analysis unit 23 obtains all the lengths of the continuous black areas in the binary data series output from the binarization unit 22, and for example, as shown in FIG. N / 2 pieces higher than a predetermined threshold are set as the feature amount of the document data together with the appearance frequency.

なお、頻度分布解析部２３において、二値データ系列における連続する白領域の長さを求め、例えば、それらの値のうち出現頻度が所定のしきい値よりも高いものをその出現頻度とともにこの文書データの特徴量に含めるようにしても良い。すなわち、二値化結果における連続する黒領域及び白領域の少なくともどちらか一方についての特徴を示す値を総てについて求めて頻度分布を解析し、その解析結果を前記文書データの特徴量とする。 The frequency distribution analysis unit 23 obtains the length of the continuous white area in the binary data series. For example, those values whose appearance frequency is higher than a predetermined threshold are included in the document together with the appearance frequency. You may make it include in the feature-value of data. That is, all the values indicating the characteristics of at least one of the continuous black area and white area in the binarization result are obtained and the frequency distribution is analyzed, and the analysis result is used as the feature amount of the document data.

また、例えば、求めた値の平均値、中央値あるいは四分位数等の代表値を求め、当該代表値をこの文書データの特徴量の一部としても良い。さらに、図５に示すように、投影波形データ自体あるいはその二値化結果を分割して、この分割した各々から上記のような代表値を求め、当該代表値をこの文書データの特徴量の一部としても良い。ただし、特徴量の求め方については上記の手法に限られるものではない。 Further, for example, a representative value such as an average value, a median value, or a quartile of the obtained values may be obtained, and the representative value may be a part of the feature amount of the document data. Further, as shown in FIG. 5, the projection waveform data itself or its binarization result is divided, the representative value as described above is obtained from each of the divided pieces, and the representative value is used as one of the feature values of the document data. It is good as a part. However, the method for obtaining the feature value is not limited to the above method.

再び図１において、文書蓄積部１４は、入力された文書データを特徴量抽出部１２によって抽出されたｎ種類の特徴量と関連付けて記憶蓄積するものであり、ハードディスクドライブやＤＶＤ(Digital Versatile Disc)−ＲＡＭ／±ＲＷ／±Ｒドライブ等の大容量記憶装置によって実現される。 Referring back to FIG. 1, the document storage unit 14 stores and stores the input document data in association with the n types of feature values extracted by the feature value extraction unit 12, and includes a hard disk drive and a DVD (Digital Versatile Disc). -Realized by a mass storage device such as a RAM / ± RW / ± R drive.

類似度評価部１３は、特徴量抽出部１２によって抽出され、文書データと関連付けられて文書蓄積部１４に蓄積されている特徴量について、複数の特徴量が蓄積されていれば、複数の特徴量をそれぞれ比較して相互間の類似度を求める。ここでいう類似度とは、例えば、特徴量がベクトル表現のもの（以下、この特徴量を「特徴ベクトル」と記す）であれば、各々の文書データに関連付けられた特徴ベクトル間のユークリッド距離に基づいて評価される度合いである。 The similarity evaluation unit 13 extracts a plurality of feature amounts if a plurality of feature amounts are accumulated for the feature amounts extracted by the feature amount extraction unit 12 and associated with the document data and accumulated in the document accumulation unit 14. Are respectively compared to obtain the similarity between them. The similarity referred to here is, for example, the Euclidean distance between feature vectors associated with each document data if the feature value is a vector expression (hereinafter, this feature value is referred to as “feature vector”). It is a degree evaluated based on.

具体的には、図６（Ａ）に示すように、特徴ベクトル間のユークリッド距離が所定のしきい値よりも小さい場合を類似度：大として評価し、また図６（Ｂ）に示すように、特徴ベクトル間のユークリッド距離が当該所定のしきい値以上の場合を類似度：小として評価する。ただし、特徴ベクトル間の距離の定義についてはユークリッド距離に限られるものではない。 Specifically, as shown in FIG. 6A, the case where the Euclidean distance between feature vectors is smaller than a predetermined threshold is evaluated as similarity: large, and as shown in FIG. 6B. The case where the Euclidean distance between feature vectors is equal to or greater than the predetermined threshold is evaluated as similarity: small. However, the definition of the distance between feature vectors is not limited to the Euclidean distance.

文書区切り部１５は、一連の文書データ間の類似度を類似度評価部１３によって求め、この求めた類似度を基に一連の文書データに区切りを入れる。なお、（１ページ以上の文書データからなる）文書単位に蓄積する方法としては、例えば、文書単位にファイルフォルダを作成し、対応する文書データを入力順の連番を持つファイル名で格納する方法や、複数ページを保持できるマルチページＴＩＦＦ(Tagged Image File Format)のような画像ファイルフォーマットを用いる方法がある。 The document delimiter 15 obtains a similarity between a series of document data by the similarity evaluation unit 13 and puts a delimiter into the series of document data based on the obtained similarity. As a method of accumulating in document units (consisting of document data of one page or more), for example, a method of creating a file folder for each document and storing corresponding document data with a file name having a sequential number in the input order Alternatively, there is a method of using an image file format such as a multi-page TIFF (Tagged Image File Format) that can hold a plurality of pages.

文書出力部１６は、出力が指示された文書データを所定の形式で出力するものであり、例えば、ＣＲＴ(Cathode Ray Tube)とその制御手段、プリンタ装置とその制御手段、磁気ディスクやメモリカード等のリード／ライト装置とその制御手段、あるいはネットワーク等を介してデータの授受を行うデータ転送装置によって実現される。すなわち、文書出力部１７からは、例えば、紙に印刷された文書、ＣＲＴに出力された画像データ、あるいはＨＴＭＬ(Hyper Text Markup Language)等により整形されたファイルが、出力結果として出力される。 The document output unit 16 outputs document data instructed to be output in a predetermined format. For example, a CRT (Cathode Ray Tube) and its control means, a printer device and its control means, a magnetic disk, a memory card, etc. The read / write device and its control means, or a data transfer device that exchanges data via a network or the like. That is, the document output unit 17 outputs, for example, a document printed on paper, image data output to a CRT, or a file formatted by HTML (Hyper Text Markup Language) as an output result.

次に、上記構成の本実施形態に係る文書ファイリング装置における文書データの区切り処理の手順について、図７のフローチャートにしたがって説明する。 Next, the procedure of document data separation processing in the document filing apparatus according to the present embodiment having the above-described configuration will be described with reference to the flowchart of FIG.

ユーザは、例えば、１ページ以上からなる紙原稿をＡＤＦにセットする。この際、紙原稿は単一の（１ページ以上からなる）文書、複数の（１ページ以上からなる）文書のいずれであっても良い。また、紙原稿をセットする際に、ユーザは文書の区切りを意識する必要はない。ＡＤＦにセットされた紙原稿は、ＡＤＦにより1ページずつスキャナ装置に送られる。このとき、スキャナ装置は、図１の文書入力部１１として機能することになる。すなわち、文書入力部１１からは、ＡＤＦにセットした紙原稿のページ数と同数の文書データが文書ファイリング装置に入力される。 For example, the user sets a paper document including one page or more on the ADF. At this time, the paper document may be either a single document (consisting of one or more pages) or a plurality of documents (comprising one or more pages). Also, when setting a paper document, the user need not be aware of document separation. The paper document set in the ADF is sent to the scanner device page by page by the ADF. At this time, the scanner device functions as the document input unit 11 in FIG. That is, from the document input unit 11, the same number of document data as the number of pages of the paper document set in the ADF is input to the document filing device.

文書入力部１１から文書データが入力されると（ステップＳ１１）、入力された文書データが既に特徴量を抽出され、特徴量と関連付けられて文書蓄積部１４に蓄積されている文書データであるか否かを判断する（ステップＳ１２）。入力された文書データが未だ特徴量を抽出されていない文書データであれば、入力された文書データから特徴量を特徴量抽出部１２によって抽出し(ステップＳ１３)、しかる後ステップＳ１４の処理に移行する。入力された文書データが既に特徴量を抽出された文書データであれば、直接ステップＳ１４の処理に移行する。すなわち、入力された文書データが既に特徴量を抽出され、特徴量と関連付けられて文書蓄積部１４に蓄積されている文書データについては改めて特徴量を抽出する処理は行わない。 When document data is input from the document input unit 11 (step S11), whether the input document data has already been extracted with a feature amount and is stored in the document storage unit 14 in association with the feature amount. It is determined whether or not (step S12). If the input document data is document data for which feature amounts have not yet been extracted, the feature amount is extracted from the input document data by the feature amount extraction unit 12 (step S13), and then the process proceeds to step S14. To do. If the input document data is already extracted document data, the process directly proceeds to step S14. That is, the process for extracting the feature amount is not performed on the document data in which the feature amount has already been extracted from the input document data and is stored in the document storage unit 14 in association with the feature amount.

上記の処理を入力された総ての文書データに対して行う（ステップＳ１４）。続いて、文書蓄積部１４に蓄積されている文書データに関連付けられた特徴量と、入力された文書データと関連付けられた特徴量とを、文書蓄積部１４に蓄積されている文書データについて類似度評価部１３によって評価する（ステップＳ１５）。そして、その評価結果に基づいて、文書データについて文書単位の区切り結果を確定し、文書単位に分離して文書蓄積部１４に蓄積する（ステップＳ１６）。 The above processing is performed for all input document data (step S14). Subsequently, the feature amount associated with the document data stored in the document storage unit 14 and the feature amount associated with the input document data are compared with each other with respect to the document data stored in the document storage unit 14. Evaluation is performed by the evaluation unit 13 (step S15). Then, based on the evaluation result, a document unit separation result is determined for the document data, separated into document units, and stored in the document storage unit 14 (step S16).

なお、上記実施形態では、特徴量抽出部１２において、二値データ系列から特徴量を抽出する際、例えば、二値データ系列において連続する黒領域の長さを総てについて求め、それらの値のうち出現頻度が所定のしきい値よりも高いｎ／２個をの出現頻度とともにこの文書データの特徴量とするとしたが、これに限られるものではなく、次のような手法を採ることも可能である。 In the above-described embodiment, when the feature amount extraction unit 12 extracts feature amounts from the binary data series, for example, all the lengths of continuous black regions in the binary data series are obtained, and the values of those values are obtained. Of these, n / 2, whose appearance frequency is higher than a predetermined threshold, is used as the feature amount of the document data together with the appearance frequency. However, the present invention is not limited to this, and the following method can also be adopted. It is.

二値データ系列から特徴量を抽出する際、例えば、二値データ系列において連続する黒領域の長さを総てについて求めることによって得られる値から歪度を求め、当該歪度をこの文書データの特徴量の一部としても良い。もし、この文書データがプレーンテキストを印刷したようなものであれば、図８に示すように、左右対称な分布となっていることが予想されるため、求めた歪度が０付近となる。 When extracting a feature amount from a binary data series, for example, the skewness is obtained from values obtained by obtaining all the lengths of continuous black areas in the binary data series, and the skewness is obtained from the document data. It may be a part of the feature amount. If this document data is printed as plain text, it is expected that the document data has a symmetrical distribution as shown in FIG. 8, and the obtained skewness is close to zero.

また、二値データ系列から特徴量を抽出する際、例えば、二値データ系列において連続する黒領域の長さを総てについて求めることによって得られる値から尖度を求め、当該尖度をこの文書データの特徴量の一部としても良い。もし、この文書データがプレーンテキストを印刷したようなものであれば、図９に示すように、値が狭い範囲内で変動していることが予想されるため、求めた尖度は大きな値となる。 Further, when extracting feature values from a binary data series, for example, kurtosis is obtained from values obtained by obtaining all the lengths of continuous black regions in the binary data series, and the kurtosis is obtained from this document. It may be a part of the data feature. If this document data is printed as plain text, the value is expected to fluctuate within a narrow range as shown in FIG. Become.

本発明の一実施形態に係る文書ファイリング装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the document filing apparatus which concerns on one Embodiment of this invention. 特徴量抽出部での特徴量抽出の概念図である。It is a conceptual diagram of the feature-value extraction in a feature-value extraction part. 特徴量抽出部の具体的な構成の一例を示すブロック図である。It is a block diagram which shows an example of the specific structure of a feature-value extraction part. 二値データ系列から特徴量を抽出する具体例の説明図である。It is explanatory drawing of the specific example which extracts the feature-value from a binary data series. 二値データ系列から特徴量を抽出する他の具体例の説明図である。It is explanatory drawing of the other specific example which extracts the feature-value from a binary data series. 類似度評価の概念図である。It is a conceptual diagram of similarity evaluation. 文書データの区切り処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the division | segmentation process of document data. 二値データ系列から特徴量を抽出する他の例の説明図である。It is explanatory drawing of the other example which extracts the feature-value from a binary data series. 二値データ系列から特徴量を抽出するさらに他の例の説明図である。It is explanatory drawing of the further another example which extracts the feature-value from a binary data series. 従来技術の課題の説明に供する図（その１）である。It is FIG. (1) with which it uses for description of the subject of a prior art. 従来技術の課題の説明に供する図（その２）である。It is FIG. (The 2) with which it uses for description of the subject of a prior art.

Explanation of symbols

１１…文書入力部、１２…特徴量抽出部、１３…類似度評価部、１４…文書蓄積部、１５…文書区切り部、１６…文書出力部、２１…投影波形生成部、２２…二値化部、２３…頻度分布解析部 DESCRIPTION OF SYMBOLS 11 ... Document input part, 12 ... Feature quantity extraction part, 13 ... Similarity evaluation part, 14 ... Document storage part, 15 ... Document delimiter part, 16 ... Document output part, 21 ... Projection waveform generation part, 22 ... Binarization Part, 23 ... frequency distribution analysis part

Claims

Projection waveform generation means for generating projection waveform data from input multi-value document data;
Binarization means for binarizing the projection waveform data generated by the projection waveform generation means;
The frequency distribution is analyzed for all the values indicating the characteristics of at least one of the continuous black area and the white area in the binarization result by the binarization means, and the analysis result is analyzed for the characteristics of the document data. A feature quantity extraction device comprising: an analysis means for obtaining a quantity.

The analysis means obtains a length of a continuous black region in the binarization result, and includes, among the obtained values, an appearance frequency higher than a predetermined threshold value in the feature amount. The feature amount extraction apparatus according to claim 1.

The analysis means obtains the length of a continuous white region in the binarization result, and includes, in the feature amount, an appearance frequency higher than a predetermined threshold value among the obtained values. The feature amount extraction apparatus according to claim 1.

The feature amount extraction apparatus according to claim 1, wherein the analysis unit sets a representative value of the obtained value as a part of the feature amount.

The feature value extraction apparatus according to claim 4, wherein the representative value is an average value, a median value, or a quartile of the obtained values.

The feature amount extraction apparatus according to claim 4, wherein the analysis unit divides the projection waveform data or the binarization result and obtains the representative value from each of the divided parts.

The feature amount according to claim 1, wherein the analysis unit obtains a skewness from values obtained for all the lengths of the continuous black regions, and the skewness is a part of the feature amount. Extraction device.

The feature amount according to claim 1, wherein the analysis unit obtains a kurtosis from values obtained for all the lengths of the continuous black regions, and uses the kurtosis as a part of the feature amount. Extraction device.

A first step of generating projection waveform data from input multivalued document data;
A second step of binarizing the projection waveform data generated in the first step;
The frequency distribution is analyzed by obtaining all the values indicating the characteristics of at least one of the continuous black area and the white area in the binarization result in the second step, and the analysis result is analyzed for the characteristics of the document data. A feature amount extraction method comprising: a third step of making a quantity.

In the third step, the length of the continuous black region in the binarization result is obtained, and the obtained value includes a value having an appearance frequency higher than a predetermined threshold among the obtained values. The feature quantity extraction method according to claim 9.

Projection waveform generation means for generating projection waveform data from input multi-value document data;
Binarization means for binarizing the projection waveform data generated by the projection waveform generation means;
The frequency distribution is analyzed for all the values indicating the characteristics of at least one of the continuous black area and the white area in the binarization result by the binarization means, and the analysis result is analyzed for the characteristics of the document data. An analysis means for quantity;
A document filing apparatus comprising: a determination unit that determines a break of a document unit based on the feature amount obtained by the analysis unit.

The analysis means obtains a length of a continuous black region in the binarization result, and includes, among the obtained values, an appearance frequency higher than a predetermined threshold value in the feature amount. The document filing apparatus according to claim 11.