JP4442208B2

JP4442208B2 - Character string notation analysis method and apparatus

Info

Publication number: JP4442208B2
Application number: JP2003408413A
Authority: JP
Inventors: 健永崎; 勝美丸川
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2003-12-08
Filing date: 2003-12-08
Publication date: 2010-03-31
Anticipated expiration: 2023-12-08
Also published as: JP2005173669A

Description

本発明は、文法を用いた文字列認識手法及び文字列表記解析プログラムを記録した記録媒体に関する。 The present invention relates to a character string recognition method using a grammar and a recording medium on which a character string notation analysis program is recorded.

文字列認識における文字切出及び文字認識の不確定さを補い、文字列画像を文字列テキストに変換するために、文字列表記解析処理は広く使われている。そのアルゴリズムとしては、形態素解析を用いたものや、ＲＴＮ照合（再帰遷移ネットワーク照合）、上昇型構文解析アルゴリズムを用いたものが一般的である。 Character string notation analysis processing is widely used to compensate for character extraction and character recognition uncertainty in character string recognition, and to convert character string images into character string text. As the algorithm, those using morphological analysis, RTN matching (recursive transition network matching), and ascending parsing algorithm are generally used.

例えば、特開平０５−１０８８９１号公報（特許文献1）では、ＯＣＲの読取精度を向上する手段としてＯＣＲの認識結果に形態素解析を適用する手法が記されている。形態素解析等の知識処理を行うことで誤読を訂正することは可能であるが、通常の形態素解析で用いる辞書は新聞等の一般文章を対象としており、特殊な業務用途の文書を精度良く校正するためには、その分野に適合した特殊辞書を追加定義する必要がある。このため保守性や計算量の面で問題が残る。更には、形態素解析という幅広い表記知識を対象とするため、表記知識の解析に時間が掛ったり、また表記解析において膨大な記憶容量を必要とするという問題がある。 For example, Japanese Patent Laying-Open No. 05-108891 (Patent Document 1) describes a method of applying morphological analysis to an OCR recognition result as means for improving OCR reading accuracy. Although it is possible to correct misreading by performing knowledge processing such as morphological analysis, the dictionary used in normal morphological analysis is intended for general sentences such as newspapers, and proofreads documents for special business use with high accuracy. To do this, it is necessary to define additional special dictionaries suitable for the field. Therefore, problems remain in terms of maintainability and computational complexity. Furthermore, since a wide range of notation knowledge called morphological analysis is targeted, it takes time to analyze notation knowledge, and there is a problem that enormous storage capacity is required for notation analysis.

また、特開２００２−１１７３７４号公報（特許文献２）では、手書き数字列に対して上昇型構文解析を使った文字列表記解析処理が提案されている。一般に上昇型構文解析は下降型構文解析に比べて計算量が削減できるとされており、数字列等の表記が単純な規則で表現できるものに対して多く適用されている。しかし、文字列認識において起こり得る文字の誤不読、ノイズ混入等の問題に対してのロバスト性には、そのアルゴリズムが最適性を保証するものではないため、限界がある。 Japanese Patent Laid-Open No. 2002-117374 (Patent Document 2) proposes a character string notation analysis process using ascending syntax analysis for a handwritten digit string. In general, it is said that ascending parsing can reduce the amount of calculation compared to descending parsing, and it is often applied to what can be expressed by a simple rule such as a numeric string. However, there is a limit to the robustness against problems such as misreading of characters and noise mixing that can occur in character string recognition because the algorithm does not guarantee optimality.

一方、動的計画法（ＤＰ）を用いた照合アルゴリズムは音声認識や文字認識の分野で、解の最適性を保証するロバストなアルゴリズムとして広く利用されている。そこで動的計画法の利点と表記文法の利点を結びつけるというアイデアが考えられる。
例えば特開平７−２１９５７８号公報（特許文献３）では、音声認識に動的計画法を使って認識を行う手法が提案されている。但し、ここで提案された手法は、動的計画法をＨＭＭモデルにおける単語照合の最適コストの計算のために使用しており、文法解釈及び最適語彙の選択は別の段階で行っている。従って、動的計画法を文法解釈するように拡張したものではない。 On the other hand, collation algorithms using dynamic programming (DP) are widely used in the field of speech recognition and character recognition as robust algorithms that guarantee optimal solutions. Therefore, the idea of linking the advantages of dynamic programming and the notation grammar can be considered.
For example, Japanese Patent Laid-Open No. 7-219578 (Patent Document 3) proposes a method of performing recognition using dynamic programming for speech recognition. However, the method proposed here uses dynamic programming to calculate the optimal cost of word matching in the HMM model, and grammatical interpretation and selection of the optimal vocabulary are performed at different stages. Therefore, it is not an extension of dynamic programming to grammatical interpretation.

特開平０５−１０８８９１号公報Japanese Patent Laid-Open No. 05-108891

特開２００２−１１７３７４号公報JP 2002-117374 A 特開平７−２１９５７８号公報JP-A-7-219578 特開平０９−３１９８２４号公報Japanese Patent Application Laid-Open No. 09-319824 特開２０００−２５１０１２号公報Japanese Patent Laid-Open No. 2000-251012 特開２００１−０１４３１１号公報JP 2001-014411 A

本発明は、光学式文字列認識に関し、文字欠落、接触、ノイズ混入、誤不読等による劣化状態にあっても、文字列表記知識に基づいてこれを解析し、文字列認識を行うロバストな文字列認識手法に関する。 The present invention relates to optical character string recognition, and even in a deteriorated state due to missing characters, contact, noise mixing, misreading, etc., this is analyzed based on knowledge of character string notation, and robust character string recognition is performed. The present invention relates to a character string recognition method.

従来の手法では、ＲＴＮ照合などのように、表記文法を横型探索で辿り、その解釈結果を解析木として残すため、探索空間の制約などから、文字誤不読や表記知識の不完全さに影響を受けて表記解析が失敗に陥りやすい、また解の最適性が保証されないという問題があった。本発明の第１の目的はＯＣＲ読取に起こり得る文字識別、文字切出、文字行抽出の誤りが文字列認識に与える悪影響を回避する方法を提案することである。 In the conventional method, as in RTN matching, the notation grammar is traced by a horizontal search, and the interpretation result is left as an analysis tree. As a result, there is a problem that notation analysis tends to fail and the optimality of the solution is not guaranteed. A first object of the present invention is to propose a method for avoiding the adverse effects of character identification, character extraction, and character line extraction errors that may occur in OCR reading on character string recognition.

また、従来手法での動的計画法による表記知識解析では、事前にコストネットワークを全て構築し、これに対して最適コストパス探索を行っていたため、コストネットワーク構築及びコスト計算に多くの計算容量と記憶容量が必要であった。本発明の第２の目的は、計算容量と記憶容量を抑えつつ最適コストパス探索による文字列表記解析を行う方法を提案することである。 In addition, in the notation knowledge analysis by dynamic programming in the conventional method, all cost networks are constructed in advance, and the optimal cost path search is performed for this, so a large amount of computing capacity is required for cost network construction and cost calculation. Storage capacity was needed. The second object of the present invention is to propose a method for performing character string notation analysis by optimal cost path search while suppressing calculation capacity and storage capacity.

上記第１の目的を達成するため、本発明は文字列表記解析のアルゴリズムとして解の最適性を保証する動的計画法（ＤＰ）を採用し、文字列認識に起こり得る不安定要因（文字誤読、不読、欠落、接触、ノイズ挿入）及び表記知識の不完全さに対応する。 In order to achieve the first object, the present invention employs dynamic programming (DP) that guarantees the optimality of the solution as an algorithm for character string notation analysis, and causes unstable factors (character misreading) that may occur in character string recognition. , Unread, missing, contact, noise insertion) and incomplete notation knowledge.

また、上記第２の目的を達成するため、本発明は、準正規表現による表記知識のコンパクト化と、更にスタックを用いて表記知識を順次解析し、逐次的に最適コストを計算することで、計算容量と記憶容量の軽減を達成する。 Further, in order to achieve the second object, the present invention reduces the notation knowledge by the quasi-regular expression, further analyzes the notation knowledge sequentially using the stack, and sequentially calculates the optimum cost, Achieve a reduction in computing capacity and storage capacity.

本発明によれば、ＤＰを用いた表記知識解析アルゴリズムにより、文書画像の劣化による文字列認識不具合要因（文字欠落、接触、ノイズ混入、及びこれらに起因する文字の誤不読など）に対して、解の最適性を保証するロバストな表記解析が可能となり、更に、表記知識として準正規文法を許容することによりトライ型文法記述による表記知識のコンパクト化と、計算量の軽減を図ることが可能となる。 According to the present invention, by the notation knowledge analysis algorithm using DP, character string recognition failure factors (deletion of characters, contact, noise mixing, misreading of characters caused by these, etc.) due to degradation of the document image, etc. In addition, robust notation analysis that guarantees the optimality of the solution is possible, and by allowing quasi-regular grammar as notation knowledge, notation knowledge by tri-type grammar description can be made compact and the amount of calculation can be reduced. It becomes.

始めに、図１をもとに文字列認識の処理フローの概要を説明する。本発明の実施例である文字列認識装置では、ＯＣＲ装置が紙文書を撮像して、これを電子的画像データに変換する。本処理は、元々の文書が電子的画像データである場合は省略可能である（０１０１）。次に、電子的画像データを元にして、罫線抽出、枠構造解析、読取対象枠の位置推定等の文書構造解析を行う（０１０２）。このとき使う認識処理には公知技術（特開平０９−３１９８２４号公報（特許文献４）、特開２０００−２５１０１２号公報（特許文献５）等）を利用する。次に、文書構造解析の結果を受けて、読取対象である文字行を抽出する（０１０３）。次に、文字行画像から文字パタン候補の切出しと、各文字パタンの文字識別を行う（０１０４）。文字切出パタン及び識別結果を併せて文字列仮説と称する。読取対象とする文書において、書かれ得る文字表記列が事前に決まっている場合は、文字列仮説に対して表記解析を行う（０１０５）。これにより、文字切出や文字識別の曖昧性を含んだ文字列仮設は、文字列テキストに変換され、読取結果テキストとしてＯＣＲから出力される（０１０６）、但し、表記知識での解析が十分に行えなかった場合など、読取結果テキストの信頼度が低い場合は、文字列仮説を出力とする。読取結果テキスト、並びに読取仮説データの双方は、必要であれば当該文字列の書かれた文書画像上の位置情報を保持するものとする。以上の処理により、読取結果テキスト、読取仮説データが出力され、一般にはこれらのデータを元に文書処理を行う。 First, an outline of a processing flow for character string recognition will be described with reference to FIG. In the character string recognition device according to the embodiment of the present invention, the OCR device captures a paper document and converts it into electronic image data. This process can be omitted when the original document is electronic image data (0101). Next, based on the electronic image data, document structure analysis such as ruled line extraction, frame structure analysis, and position estimation of the reading target frame is performed (0102). For the recognition processing used at this time, a known technique (Japanese Patent Laid-Open No. 09-319824 (Patent Document 4), Japanese Patent Laid-Open No. 2000-2521012 (Patent Document 5), etc.) is used. Next, in response to the result of the document structure analysis, a character line to be read is extracted (0103). Next, extraction of character pattern candidates from the character line image and character identification of each character pattern are performed (0104). The character extraction pattern and the identification result are collectively referred to as a character string hypothesis. In a document to be read, if a character notation string that can be written is determined in advance, a notation analysis is performed on the character string hypothesis (0105). As a result, the temporary character string including character extraction and character identification ambiguity is converted into character string text and output from the OCR as a read result text (0106), however, analysis with notation knowledge is sufficient. When the reliability of the read result text is low, such as when it could not be performed, the character string hypothesis is output. Both the reading result text and the reading hypothesis data hold position information on the document image in which the character string is written, if necessary. Through the above processing, the reading result text and the reading hypothesis data are output, and the document processing is generally performed based on these data.

文字列表記解析処理と文字列仮説については、図２及び図３に概要がある。図２は文字列仮説と表記知識を使った文字列認識の流れを説明した図である。また、図３は、文字列仮説の概念とデータの詳細を示した図である。図２を説明する。読取対象文字行（ａ）から、文字パタンと推定される部分を様々に切出して文字パタン候補を作り、各文字パタン候補を文字識別したものが、文字列仮説（ｂ）である。文字列仮説は、文字パタン候補、文字識別の結果得られた順位付けされた識別文字コード群、文字列仮説中での文字パタン候補間の接続関係の情報、を最低限持つものとする。このような文字列仮説の表現を、グラフ形式による表現という。 The character string notation analysis process and the character string hypothesis are outlined in FIGS. FIG. 2 is a diagram for explaining the flow of character string recognition using the character string hypothesis and notation knowledge. FIG. 3 is a diagram showing the concept of the character string hypothesis and details of the data. FIG. 2 will be described. A character string hypothesis (b) is a character pattern hypothesis (b) in which a portion estimated to be a character pattern is cut out from a character line (a) to be read to create a character pattern candidate and each character pattern candidate is identified. It is assumed that the character string hypothesis has at least character pattern candidates, ranked identification character code groups obtained as a result of character identification, and information on connection relations between character pattern candidates in the character string hypothesis. Such expression of the character string hypothesis is called expression in a graph format.

次に文字列表記知識（ｃ）を使って、文字列仮説から文字列パス（ｄ）を計算する。文字列パスとは、一意的に確定した文字コード列（テキスト）と、各文字コードに対応する文字パタンの並びを意味する。この例では文字列表記知識をＯＲ記号（｜）で単語を並べて表現している。すなわち、記号｜の間に挟まれた単語群が表記知識として指定されたことを意味する。文字列表記知識を表現するとしては、この表現以外にもトライ、文脈自由文法などを使った方法がある（特開２００１−０１４３１１号公報（特許文献６）等に記載）。文字列仮説の詳細は図３に詳しい。文字列仮説は、文字パタンの候補をアーク（０３０１）とし、文字パタンの境界をノード（０３０２）とする有向グラフとして表現される。各文字パタンには、左右（縦書きであれば上下）のノード（パタン境界）を表す境界ＩＤ番号と、文字識別候補（０３０３）及び識別類似度（０３０４）の情報が含まれる。知識処理は、この文字列仮説と文字列表記知識を入力として、文字列仮説に含まれ得る単語とそのパタン列を見つける処理である。例えば文字列表記知識にある「血液化学検査」という単語は、図３（ｂ）の文字列仮説中に、丸で示される文字コード及び文字パタン（０３０５）を辿ることで見つけることができる。当該欄に書かれる文字列の表記が事前に定まっている場合、本処理を行うことで文字列コードが確定する。 Next, the character string path (d) is calculated from the character string hypothesis using the character string notation knowledge (c). The character string path means a character code string (text) uniquely determined and a character pattern corresponding to each character code. In this example, knowledge of character string notation is expressed by arranging words with an OR symbol (|). That is, it means that a word group sandwiched between symbols | is designated as notation knowledge. In addition to this expression, there is a method using a try, a context free grammar, etc. (described in Japanese Patent Application Laid-Open No. 2001-014411 (Patent Document 6)). Details of the string hypothesis are detailed in FIG. The character string hypothesis is expressed as a directed graph in which a character pattern candidate is an arc (0301) and a character pattern boundary is a node (0302). Each character pattern includes boundary ID numbers representing left and right (upper and lower if vertical writing) nodes (pattern boundaries), information on character identification candidates (0303), and identification similarity (0304). Knowledge processing is processing for finding words and pattern strings that can be included in the character string hypothesis by using the character string hypothesis and knowledge of character string notation as input. For example, the word “blood chemistry test” in the character string notation knowledge can be found by following the character code and character pattern (0305) indicated by a circle in the character string hypothesis of FIG. When the notation of the character string written in the field is determined in advance, the character string code is determined by performing this process.

以上が文字列認識と、その中における文字列表記解析処理の位置付けに関する説明である。このような処理において、本発明の目的は次の２点に掛る。
１）読取対象画像の劣化（文字欠落、接触、ノイズ混入、及びその結果起こリ得る文字の誤不読等）に対してロバストな文字列表記解析を実現するため、解の最適性を保証する動的計画法（ＤＰ）を使った照合を行う。
２）ＤＰ照合が使う表記知識辞書は、キーワードの表記の揺れが容易に定義できるよう準正規表現による辞書記述を認める。また、準正規表現を許すことにより表記辞書のコンパクトな記述を可能とする。 The above is the description regarding the character string recognition and the positioning of the character string notation analysis process therein. In such processing, the object of the present invention is related to the following two points.
1) In order to realize robust character string notation analysis against the deterioration of the image to be read (letter missing, contact, noise, and misreading of characters that may occur as a result), the optimality of the solution is guaranteed. Match using dynamic programming (DP).
2) The notation knowledge dictionary used by the DP collation allows a quasi-regular dictionary description to allow easy definition of keyword notation. In addition, by allowing quasi-regular expressions, a compact description of the notation dictionary is possible.

動的計画法（以下、ＤＰ）を使った文字列照合は速度がやや遅いという難点があるものの、一般にＮ文字不読・挿入・欠陥を許しての照合が可能で、編集距離（ＥｄｉｔＤｉｓｔａｎｃｅ）上で最適な文字列を求められるなど、従来の表記解析処理に無いメリットがある。一般に表記解析処理では、照合するべき文字列の記述をＲＴＮ文法（ＲｅｃｕｒｓｉｖｅＴｒａｎｓｉｔｉｏｎＮｅｔｗｏｒｋ、再帰遷移構造ネットワーク型文法）を使うが、このＲＴＮ文法の構造に適合してＤＰ照合を行うエンジンは存在しない。これはＲＴＮ文法が再帰的な記述を許容するためである。ＤＰ照合では問題を時系列（１次元系列）上の最適制御問題として捕え、時系列に並ぶ選択肢を引数とする関数漸化式を解く。これを文字列照合の問題に置き換えると、入力文字パタン群を説明するように１次元の文法列上で最適な文法子を選択する問題となるが、ＲＴＮ文法では再帰構造が定義できるため文法列が１次元系列にならない。このため原理的にはＤＰが適用できない。 Although string matching using dynamic programming (DP) is somewhat slow, it is generally possible to collate with N characters unreadable / inserted / defective, and edit distance (Edit Distance) There is an advantage not found in conventional notation analysis processing, such as being able to find the optimal character string. In general, in notation analysis processing, RTN grammar (Recursive Transition Network, recursive transition structure network type grammar) is used to describe a character string to be collated, but there is no engine that performs DP collation in conformity with the structure of this RTN grammar. This is because the RTN grammar allows recursive description. In DP matching, the problem is caught as an optimal control problem on a time series (one-dimensional series), and a function recurrence formula is solved with the options arranged in the time series as arguments. If this is replaced with a character string matching problem, it becomes a problem of selecting an optimal grammar on a one-dimensional grammar string so as to explain the input character pattern group. However, since the recursive structure can be defined in the RTN grammar, the grammar string Does not become a one-dimensional series. For this reason, DP cannot be applied in principle.

そこでＤＰで用いる文法構造はＲＴＮ文法ではなく、単純な正規表現を使うことにする。但し、ＤＰ照合アルゴリズムとの関連で正規表現による表記にも幾つかのレベルを想定しなければならない。このことは次に述べる。 Therefore, the grammar structure used in DP is not a RTN grammar but a simple regular expression. However, there are several levels of regular expression notation in relation to the DP matching algorithm. This will be described next.

正規表現では、並列｜、集合（）、省略［］（任意文字を表す＊や？は考慮外とする）などの特殊文字と、通常の文字を使って表記を定義する。また、住所辞書や姓名辞書への適応も考えてＤＰ照合エンジンで使うと考えられる幾つかの特殊記号がある。抽出したいキーワードは特殊記号と併せて準正規表現を使って記述するが、ここでその記述様式を次の３つに分類する。この分類は次節で述べるＤＰ照合アルゴリズムと関係がある。 In regular expressions, notations are defined using special characters such as parallel |, set (), abbreviation [] (* and? Are not considered), and normal characters. In addition, there are some special symbols that can be used in the DP matching engine in consideration of adaptation to address dictionaries and first name surname dictionaries. The keywords to be extracted are described using quasi-regular expressions together with special symbols. Here, the description format is classified into the following three. This classification is related to the DP matching algorithm described in the next section.

１）単語列挙型表記
単語列挙型表記の例を次に掲げる。この定義の中には「精神療法」「精神分析」「精神分析療法」「精神分析学」という4つの単語が定義されている。
Ｓ（精神療法｜精神分析｜精神分析療法｜精神分析学）Ｅ
この例では単語の定義列はＳで始まり、Ｅで終わるとしている。また、丸括弧の中に4つの単語が並列する形で書かれている。 1) Word enumeration notation An example of word enumeration notation is given below. In this definition, the four words “psychotherapy”, “psychoanalysis”, “psychoanalysis therapy” and “psychoanalysis” are defined.
S (psychotherapy | psychoanalysis | psychoanalysis therapy | psychoanalysis) E
In this example, the word definition sequence starts with S and ends with E. In addition, four words are written in parentheses in parallel.

２）トライ型表記
上記４単語の定義をトライ型に書き直すと次のようになる。トライ表記を使うと辞書の定義をコンパクトにできるというメリットがある。
Ｓ（精神（療法｜分析［療法｜学］））Ｅ
３）途中省略・分岐許容型表記
途中省略・分岐許容では、次のような単語中における表記揺れを文法定義で吸収するために使うことができる。
Ｓ（コンピュータ［ー］（管理｜診断）料）Ｅ
これは「コンピューター」という表記と、「コンピュータ」という表記の揺れを記したものである。また、その後に続く語彙として「管理料」や「診断料」などが存在することを示している。 2) Tri-type notation Rewriting the definition of the above four words into a tri-type is as follows. There is a merit that the dictionary definition can be made compact by using the tri notation.
S (Mental (Therapy | Analysis [Therapy | Study])) E
3) Halfway omission / branch allowance type notation omission / branch allowance can be used to absorb notation fluctuations in the following words in the grammar definition.
S (Computer [-] (Management | Diagnosis) Fee) E
This describes the notation of “computer” and the shaking of the notation of “computer”. It also indicates that “management fee” and “diagnosis fee” exist as vocabularies that follow.

ここでＤＰ照合アルゴリズムを次のように定式化する。照合対象のネットワーク（候補文字ネットワーク）をＮ、照合文法をＣで表すとする。照合対象ネットワークは、始点ノード、終点ノード及びエッジの集合として構成され、これを Here, the DP collation algorithm is formulated as follows. Assume that the network to be verified (candidate character network) is N, and the verification grammar is C. The network to be verified is configured as a set of start point nodes, end point nodes, and edges.

と記す。各記号は始点ノードｎｓ、終点ノードｎｅ、エッジｎを意味する。このエッジは候補文字ネットワーク上での文字パタンに相当する。照合文法Ｃも同様にグラフ構造を持つ。これを . Each symbol means a start point node ns, an end point node ne, and an edge n. This edge corresponds to a character pattern on the candidate character network. Similarly, the collation grammar C has a graph structure. this

と表す。ここで各記号は、始点ノードｃｓ、終点ノードｃｅ、エッジｃを意味する。このエッジは文法Ｃ上にある文法子に相当する。 It expresses. Here, each symbol means a start point node cs, an end point node ce, and an edge c. This edge corresponds to a grammar child on the grammar C.

ＤＰ照合は２つのグラフ間のマッチングを行う。ここで、ある文法子ｃｊを、ある文字パタンｎｑに照合することを考える。この時の照合コストを DP matching performs matching between two graphs. Here, it is considered that a certain grammar cj is collated with a certain character pattern nq. The matching cost at this time

で表す。文法子ｃｊの前段の文字はｃｉであるが、これは複数存在し得るとして、その集合を Represented by The first character of the grammar child cj is ci.

で表すことにする。但し、 It will be expressed as However,

はグラフＧについて、ある着目エッジｅの前段に位置するエッジ集合を表すと約束する。また、文字パタンｎｑの前段に位置する文字パタンをｎｐで表し、 Promises that the graph G represents a set of edges located before the certain edge e. In addition, the character pattern located in the preceding stage of the character pattern nq is represented by np,

と書く。これもまた複数存在し得るとする。 Write. It is also possible that a plurality of these can exist.

この時のＤＰ照合コストは次の漸化式で表せる。 The DP verification cost at this time can be expressed by the following recurrence formula.

ここでＭａｔｃｈ（）は、文法子ｃが文字パタンｎの文字候補群の中に含まれるか否かを表すコストである。Ｐａｔｈ（）は文字パタンｎｐ、ｎｑが接続するか否かを表すコスト、Ｎｅｘｔは文法子ｃｉ、ｃｊが接続するか否かを表すコストである。上記再帰式を全ての文法子列ｃｊ∈Ｃについて求めることで、最適な文法列が見つかる。これがＤＰ照合である。 Here, Match () is a cost indicating whether or not the grammar child c is included in the character candidate group of the character pattern n. Path () is a cost indicating whether or not the character patterns np and nq are connected, and Next is a cost indicating whether or not the grammars ci and cj are connected. The optimal grammar sequence is found by obtaining the above recursive expression for all grammar sequence cjεC. This is the DP verification.

注意するべき事はＤＰによって最適解を求めるためにはグラフＣとＮの構造が半順序集合でなければならないことである。すなわち、候補文字ネットワークＮと文法Ｃの構造は有向グラフであり、内部に循環路を含んではならない。更に先ほども述べたように、文法Ｃの構造として次の３つのレベルを分けて考える必要がある。
文法レベル１列挙型Ｓ（単語｜単語｜・・・｜単語）Ｅ
文法レベル２トライ型Ｓ（単語（（単語｜単語）｜単語｜・・・））Ｅ
文法レベル３分岐許容型Ｓ（（単語（単語｜単語）単語）・・・）Ｅ
単純ＤＰ照合は、文法レベル１の単語列挙型に対応するものである。これに対して、トライ表記や分岐許容表記を使って辞書容量・照合計算量削減を行うためには、文法レベル２以上への対応が必要である。ＲＴＮ文法は、非終端記号を再帰的に使わない限りにおいては、上記の文法レベル３に該当すると考えられる。ＤＰ照合の立場からみると、文法レベル２と文法レベル３の違いは、ある文法子ｃｊの前段にある文法子群PreE(C, c_j ) が単数か複数かということに相当する。文法レベル２の場合は前段の文字が単数であるため、次に述べるようなスタック型のＤＰ計算テーブルを使うことで照合アルゴリズムが容易に実現できる。また、文法レベル３の場合は、前段文字が複数あるため、複数のスタックを使ったアルゴリズムが必要になる。ここでは文法レベル２の照合アルゴリズムについて述べる。 It should be noted that the structures of graphs C and N must be partially ordered in order to obtain an optimal solution by DP. That is, the structures of the candidate character network N and the grammar C are directed graphs, and must not include a circulation path inside. Furthermore, as described above, it is necessary to consider the following three levels as the structure of grammar C.
Grammar Level 1 Enumeration S (word | word | ... | word) E
Grammar level 2 Tri-type S (word ((word | word) | word | ...)) E
Grammar Level 3 Branch Allowed Type S ((word (word | word) word) ...) E
Simple DP matching corresponds to a grammar level 1 word enumeration type. On the other hand, in order to reduce the dictionary capacity and the collation calculation amount using the trie notation and the branch allowable notation, it is necessary to cope with grammar level 2 or higher. The RTN grammar is considered to correspond to the above grammar level 3 unless a non-terminal symbol is used recursively. From the standpoint of DP verification, the difference between grammar level 2 and grammar level 3 corresponds to whether the grammar group PreE (C, c _j ) preceding the grammar class cj is singular or plural. In the case of grammar level 2, since the preceding character is singular, a collation algorithm can be easily realized by using a stack type DP calculation table as described below. In the case of grammar level 3, since there are a plurality of preceding characters, an algorithm using a plurality of stacks is required. Here, a grammar level 2 collation algorithm is described.

上に定式化したＤＰ照合は、ネットワークの最適コストパス問題としても捕らえることができる。最適コストの計算方式としては、コストネットワーク事前構築型とスタックテーブル型の２種類が考えられる。図４を例として、そのメリット・デメリットを以下に比較する。 The DP verification formulated above can also be viewed as an optimal cost path problem for the network. There are two types of optimal cost calculation methods: a cost network pre-construction type and a stack table type. Using FIG. 4 as an example, the merits and demerits are compared below.

図４（ａ）は文字列同士の照合を編集距離に基づく最小コストパス問題として捕らえた場合の概念図である。ここでは２文字列間の照合関係をラティス(格子)に張られた枝として表現している。格子上の縦横の枝はコスト１を取る，斜め枝は上と左の文字が一致すれば０，不一致なら１のコストを取る。この時、左上点から右下点への最小コストが編集距離（ＥｄｉｔＤｉｓｔａｎｃｅ）を表し、その最小コストパスが文字列間の対応関係を表す。この例では，正解文字列中３文字が正解で、１文字が対応しないことを表している。 FIG. 4A is a conceptual diagram when matching of character strings is regarded as a minimum cost path problem based on an edit distance. Here, the collation relationship between two character strings is expressed as a branch stretched on a lattice. Vertical and horizontal branches on the grid take a cost of 1, while diagonal branches take a cost of 0 if the upper and left characters match, and cost 1 if they do not match. At this time, the minimum cost from the upper left point to the lower right point represents the edit distance (Edit Distance), and the minimum cost path represents the correspondence between the character strings. In this example, three characters in the correct character string are correct and one character does not correspond.

１）コストネットワーク事前構築型（図４（ｂ））
照合するべき単語群（単語の集合）と、文字候補ネットワークとの間にＤＰコストネットワークを予め構築しておき、コストネットワーク上での最短パスを求める方式。全単語のコストネットワークを事前に構築する場合は、ＤＰ計算に必要な記憶容量は候補ネットサイズ×全文法長に比例する。容量は増えるが、単純な単語照合ではこれで十分である。ＤＰ計算ではコストネットワーク構築に最も時間が係る。 1) Cost network pre-construction type (Fig. 4 (b))
A method in which a DP cost network is established in advance between a word group (a set of words) to be verified and a character candidate network, and the shortest path on the cost network is obtained. When a cost network of all words is constructed in advance, the storage capacity required for DP calculation is proportional to the candidate net size × total grammar length. Although the capacity increases, this is sufficient for simple word matching. In DP calculation, cost network construction takes the longest time.

２）スタックテーブル型（図４（ｃ））
文法（照合するべき単語群をトライ等の構造で記述したもの）の記述列に対して左から走査し、ＤＰ計算テーブルを単語文字列上で連続する２文字間で所持する方式。トライ構造では文法構造の記述子（「（）や「｜」などの文字」が入るので、連続する２文字間は、文法上で必ずしも隣合うとは限らない。そのため、文法を右から左に走査する際に、適切な文法上の区切り目で、ＤＰ計算テーブルをスタックに退避する必要がある。ＤＰ計算に必要な記憶容量は、最低で候補ネットサイズ×２、パス列を記憶する場合は候補ネットサイズ×最大単語長に比例する。 2) Stack table type (Fig. 4 (c))
A method of scanning a description string of a grammar (a group of words to be collated with a structure such as a trie) from the left and possessing a DP calculation table between two consecutive characters on the word string. In the trie structure, a grammar structure descriptor (characters such as “()” and “|”) is included, so two consecutive characters are not necessarily adjacent to each other in the grammar. When scanning, it is necessary to save the DP calculation table to the stack at an appropriate grammatical break.The storage capacity required for DP calculation is at least the candidate net size × 2, when storing a path string Proportional net size x maximum word length.

コストネットワーク事前構築型はアルゴリズムが容易であるが、候補ネットワークＮと文法Ｃのエッジ数の積｜Ｎ｜×｜Ｃ｜に比例した記憶容量が必要となる。本特許では、キーワード抽出エンジンのコンパクト化（辞書容量、計算時容量）を図るためスタックテーブル型の計算方式を採用する。スタックテーブル型では、この容量が最小で｜Ｎ｜×２、単語パスを記憶する場合でも｜Ｎ｜×最大単語長にまで減らすことが出来る。 The cost network pre-construction type has an easy algorithm, but requires a storage capacity proportional to the product | N | × | C | of the number of edges of the candidate network N and the grammar C. In this patent, a stack table type calculation method is adopted in order to make the keyword extraction engine compact (dictionary capacity, calculation time capacity). In the stack table type, this capacity is minimum | N | × 2, and even when a word path is stored, | N | × maximum word length can be reduced.

次にスタックテーブルを使ったＤＰ照合アルゴリズムについて記す。本アルゴリズムが解析対象とする文法は、特殊記号Ｓ、Ｅ、（、）、［、］、｜と、通常の文法子（文字コードのこと、Ｃで表す）から構成される。トライ型文法に適合したＤＰ照合アルゴリズムでは、この文法記述列を右から左へと走査し、ＤＰテーブルを適時スタックにプッシュ・ポップすることで計算を行う。 Next, the DP verification algorithm using the stack table is described. The grammar to be analyzed by this algorithm is composed of special symbols S, E, (,), [,], |, and an ordinary grammar (represented by a character code, C). In the DP collation algorithm suitable for the tri-type grammar, the grammar description string is scanned from right to left, and the DP table is pushed and popped onto the stack in a timely manner.

ＤＰテーブルとは、現在走査中の文法子ｃｊが、候補ネットワーク上の各文字パタンｎｑ∈Ｎに対応すると仮定した場合の最適コスト（及びパス）を保持するテーブルである。文法子ｃｊに対するＤＰテーブルをＤｐｔ（ｃｊ）で表すならば、 The DP table is a table that holds the optimum cost (and path) when it is assumed that the currently scanned grammar cj corresponds to each character pattern nqεN on the candidate network. If the DP table for the grammar child cj is represented by Dpt (cj),

である。最適コストの計算は、文法子ｃｊに対するＤＰテーブルＤｐｔ（ｃｊ）を先頭の文法子Sから順次計算して行われる。いま着目する文法子ｃｊに対して、ＤＰテーブルを計算するためには、前段の文法子ｃｉに対応する前段ＤＰテーブルＤｐｔ（ｃｉ）が必要である。この前段のＤＰテーブルはスタックを参照することで得られる。このスタック走査のタイミングを図５に示す。また、単語集合｛１２、３、３４、３５、６７｝をトライ構造で表記した文法Ｓ（１２｜３［４｜５］｜６７）Ｅを具体的な例として、処理の内容を図６に、スタックの中身がどのように遷移するかを図７に示す。以上のＤＰコスト計算式（式７、式８）、及びスタック走査タイミング（図５）により、文法駆動ＤＰ型文字列表記処理が説明された。 It is. The optimum cost is calculated by sequentially calculating the DP table Dpt (cj) for the grammar cj from the first grammar S. In order to calculate the DP table for the grammar cj of interest, the previous DP table Dpt (ci) corresponding to the previous grammar ci is required. This previous DP table can be obtained by referring to the stack. The stack scanning timing is shown in FIG. Further, the grammar S (12 | 3 [4 | 5] | 67) E in which the word set {12, 3, 34, 35, 67} is expressed in a trie structure is taken as a specific example, and the processing contents are shown in FIG. FIG. 7 shows how the contents of the stack transition. The grammar-driven DP type character string notation process has been described by the above DP cost calculation formulas (Formulas 7 and 8) and stack scanning timing (FIG. 5).

ＤＰ照合エンジンはＮ文字の不読・スキップ・省略を許容して照合するため，従来のＲＴＮ照合型知識処理で拾えなかった単語が拾える．更に、表記として準正規文法を許容するため、トライ表現を使った表記知識のコンパクトな表現が可能になり、表記の揺れにも対応できる。これにより文書画像の中から、文書点検に必要なキーワードをピンポイントで読取り、読取ったキーワードを元に検索や内容点検を行うことが可能となり、従来のＯＣＲでは不読等の問題により困難であった文書処理の自動化が実現できる。 The DP collation engine allows N characters to be read, skipped, and omitted so that words that could not be picked up by conventional RTN collation-type knowledge processing can be picked up. Furthermore, since a semi-regular grammar is allowed as a notation, a compact expression of notation knowledge using a trie expression is possible, and it is possible to deal with fluctuations in notation. This makes it possible to pinpoint the keywords required for document inspection from the document image, and perform searches and content inspections based on the read keywords, which is difficult with conventional OCR due to problems such as non-reading. Automatic document processing can be realized.

最後になるが図８に基づいて、本特許で提案する手法を用いた文書処理システムの構築例について述べる。図８は、本特許で提案する手法によってＯＣＲ装置と文書処理装置を分離する形で文書処理システムを構成した場合の一構成例である。図８上段にはＯＣＲ装置の一構成例を、図８下段には文書画像処理装置の一構成例を示した。 Finally, based on FIG. 8, a construction example of a document processing system using the technique proposed in this patent will be described. FIG. 8 shows an example of a configuration in which a document processing system is configured in such a manner that an OCR device and a document processing device are separated by the method proposed in this patent. The upper part of FIG. 8 shows a configuration example of the OCR apparatus, and the lower part of FIG. 8 shows a configuration example of the document image processing apparatus.

まず上段のＯＣＲ装置では、画像入力装置（０８０１）により文書を電子データに変換し、それを外部記憶装置（０８０４）及びメモリ（０８０５）に蓄えて、中央演算装置（０８０６）により読取を行う。文書形式の定義などは外部記憶装置（０８０４）に蓄えられており、文書構造解析にはここに蓄えた定義を参照する。これらの処理は操作端末装置（０８０２）を通して人間が操作可能であり、処理結果等は表示端末装置（０８０３）を通して表示され、外部記憶装置に蓄積または通信装置（０８０７）を通して外部装置にデータが送られる。ＯＣＲが読取った結果は、従来の装置のようにテキストファイルとしても出力できるが、ＯＣＲ読取仮説データとしても出力できる。ＯＣＲ読取仮説データは外部記憶装置に蓄えられるか、または通信装置を通して外部の装置に送られる。その際、ＯＣＲ読取仮説データにはＯＣＲで読取った文書（あるいは画像）に対応する文書ＩＤコードが振られるとする。この文書ＩＤコードを利用することで、紙文書または文書画像とＯＣＲ読取仮説データとの対応が取れる。 First, in the upper OCR device, a document is converted into electronic data by an image input device (0801), stored in an external storage device (0804) and a memory (0805), and read by a central processing unit (0806). The definition of the document format is stored in the external storage device (0804), and the definition stored here is referred to for the document structure analysis. These processes can be operated by humans through the operation terminal device (0802), and the processing results and the like are displayed through the display terminal device (0803), and stored in the external storage device or sent to the external device through the communication device (0807). It is done. The result read by the OCR can be output as a text file as in the conventional apparatus, but can also be output as OCR reading hypothesis data. The OCR reading hypothesis data is stored in an external storage device or sent to an external device through a communication device. At this time, it is assumed that the document ID code corresponding to the document (or image) read by OCR is assigned to the OCR reading hypothesis data. By using this document ID code, the correspondence between the paper document or document image and the OCR reading hypothesis data can be taken.

図８下段の文書画像処理装置は、上記ＯＣＲ機能装置から出力されたテキストファイル若しくはＯＣＲ読取仮説データを用いて文書検索・文書閲覧を行うもので、一旦ＯＣＲ出力データが生成された文書に対しては何度でも繰り返し（ＯＣＲ出力データが存在する限り）検索・閲覧できる機能を有する。この文書画像処理装置は、通信装置（０８１５）及び外部記憶装置（０８１２）よりＯＣＲ付加データを読み、これをメモリ（０８１３）にロードして、中央演算装置（０８１４）により検索・閲覧処理を行う。検索したい単語及び文書検索ルールは、外部記憶装置に蓄えられているか、または操作端末装置（０８１０）から入力することができる。単語の検索結果は表示端末装置（０８１１）を通して表示され、また通信装置を通して外部機器にデータを送信する、または外部記憶装置に検索結果を蓄積することができる。これらの装置は通信バス（０８０７、０８０８、０８０９、０８１５、０８１６）によってつながれている。本特許で述べた文法駆動ＤＰ型表記解析処理は図８上段の中に組み込んでも、あるいは図８下段に組み込んでも良い。 The document image processing apparatus in the lower part of FIG. 8 performs document search and document browsing using the text file or OCR reading hypothesis data output from the OCR function apparatus. For the document once the OCR output data is generated. Has a function to search and browse repeatedly (as long as OCR output data exists). This document image processing device reads OCR additional data from the communication device (0815) and the external storage device (0812), loads it into the memory (0813), and performs search / view processing by the central processing unit (0814). . Words and document search rules to be searched are stored in the external storage device or can be input from the operation terminal device (0810). The word search result is displayed through the display terminal device (0811), and data can be transmitted to the external device through the communication device, or the search result can be stored in the external storage device. These devices are connected by a communication bus (0807, 0808, 0809, 0815, 0816). The grammar driven DP type notation analysis processing described in this patent may be incorporated in the upper part of FIG. 8 or may be incorporated in the lower part of FIG.

文字列認識の処理フロー図。The processing flow figure of character string recognition. 文字列仮説を使った表記知識処理の概念図。The conceptual diagram of the notation knowledge process using a character string hypothesis. 文字列仮説の概念図。The conceptual diagram of a character string hypothesis. 最適コストパス探索問題としての表記知識処理。Notation knowledge processing as an optimal cost path search problem. 文法駆動ＤＰ型表記知識処理の演算タイミング表。Calculation timing table for grammar-driven DP type knowledge processing. 具体例に基づく文法駆動ＤＰ型表記知識処理過程。Grammar driven DP type notation knowledge processing process based on specific examples. 文法駆動ＤＰ型表記知識処理におけるスタック状態変遷図。The stack state transition diagram in the grammar-driven DP type notation knowledge processing. ＯＣＲ装置と文書処理装置の構成例。2 shows a configuration example of an OCR device and a document processing device.

Explanation of symbols

０１０１…画像入力部、０１０２…文書構造解析部、０１０３…文字行抽出部、０１０４…文字列仮説作成部、０１０５…文法駆動ＤＰ型文字列表記解析部、０１０６…テキスト出力部、０１０１…従来の文書処理システムに入力される紙文書
０３０１…文字列仮説上の文字パタン、０３０２…文字列仮説上のパタン境界、０３０３…文字列仮説上の文字識別結果、０３０４…文字列仮説上の文字識別類似度、０３０５…文字列仮説上から検索された単語
０８０１…ＯＣＲ装置部における画像入力装置、０８０２…ＯＣＲ装置部における操作端末装置、０８０３…ＯＣＲ装置部における表示端末装置、０８０４…ＯＣＲ装置部における外部記憶装置、０８０５…ＯＣＲ装置部におけるメモリ、０８０６…ＯＣＲ装置部におけるＣＰＵ、０８０７…ＯＣＲ装置部における通信装置、０８０８…ＯＣＲ装置部における通信バス、０８０９…ネットワーク部、０８１０…文書画像処理装置部における操作端末装置、０８１１…文書画像処理装置部における表示端末装置、０８１２…文書画像処理装置部における外部記憶装置、０８１３…文書画像処理装置部におけるメモリ、０８１４…文書画像処理装置部におけるＣＰＵ、０８１５…文書画像処理装置部における通信装置、０８１６…文書画像処理装置部における通信バス。 0101: Image input unit, 0102 ... Document structure analysis unit, 0103 ... Character line extraction unit, 0104 ... Character string hypothesis creation unit, 0105 ... Grammar driven DP type character string notation analysis unit, 0106 ... Text output unit, 0101 ... Conventional Paper document 0301 input to the document processing system ... Character pattern on character string hypothesis, 0302 ... Pattern boundary on character string hypothesis, 0303 ... Character identification result on character string hypothesis, 0304 ... Character identification similarity on character string hypothesis 0305 ... word retrieved from the character string hypothesis 0801 ... image input device in the OCR device unit, 0802 ... operation terminal device in the OCR device unit, 0803 ... display terminal device in the OCR device unit, 0804 ... external in the OCR device unit Storage device, 0805, memory in OCR device unit, 0806, CPU in OCR device unit, 080 Communication device in OCR device unit, 0808 ... Communication bus in OCR device unit, 0809 ... Network unit, 0810 ... Operation terminal device in document image processing device unit, 0811 ... Display terminal device in document image processing device unit, 0812 ... Document image An external storage device in the processing device unit, 0813 ... a memory in the document image processing device unit, 0814 ... a CPU in the document image processing device unit, 0815 ... a communication device in the document image processing device unit, and 0816 ... a communication bus in the document image processing device unit.

Claims

A character recognition device having an image input unit, a character string hypothesis creating unit, a character string notation analysis unit, and a notation knowledge dictionary storage unit,
The character string hypothesis creating unit outputs a character string hypothesis including character extraction and character recognition uncertainty based on the image input to the image input unit,
The character string notation analysis unit uses dynamic knowledge using character string notation knowledge in which a plurality of character strings stored in the notation knowledge dictionary storage unit are expressed together by a grammar that means parallel, set, or omission. A character string recognition device, wherein the character string hypothesis is analyzed by a programming algorithm and character string recognition is performed.

The character string recognition device according to claim 1,
In the process of interpreting the character string notation grammar, the character string notation analysis unit sequentially develops the character string notation analysis result of the character string hypothesis using a stack for each character of the character string notation knowledge. A character string recognition device characterized in that the amount of calculation for cost calculation is reduced.

A character string recognition method in a character recognition device having an image input unit, a character string hypothesis creation unit, a character string notation analysis unit, and a notation knowledge dictionary storage unit,
A first step of outputting a character string hypothesis including character extraction and character recognition uncertainty based on the image input to the image input unit;
The character string hypothesis is obtained by a dynamic programming algorithm using character string notation knowledge in which a plurality of character strings stored in the notation knowledge dictionary storage unit are expressed together by a grammar indicating parallel, set, and omission. And a second step of analyzing the character string.

The character string recognition method according to claim 3,
In the process of interpreting the character string notation grammar, the second step sequentially develops the character string hypothesis notation analysis result using a stack for each character of the character string notation knowledge. A character string recognition method characterized by reducing a calculation amount of cost calculation.