JP2003515229A

JP2003515229A - Symbol classification with shape features given to neural networks

Info

Publication number: JP2003515229A
Application number: JP2001539231A
Authority: JP
Inventors: ラリタアグニオトリ; ネベンカディミトロヴァ
Original assignee: Philips Electronics NV
Current assignee: Koninklijke Philips NV
Priority date: 1999-11-17
Filing date: 2000-11-02
Publication date: 2003-04-22
Also published as: KR20010110415A; WO2001037211A1; EP1147484A1

Abstract

(57)【要約】ビデオストリームにおける記号、例えばテキストを分類する画像処理装置及び方法は、特徴空間がサイズ、並進及び回転不変な形状依存特徴から得られる逆伝搬ニューラルネットワークを用いる。様々な例示的特徴空間は、例えば、規則的であり、一定のモーメント及びしきい値を持つ細化記号のドロネーの三角形分割から得られる角度のヒストグラムを論じている。このような特徴空間は、ビデオストリームにおける文字の劣った解像度のために、分類器としてのＢＰＮＮに対して良好な整合を提供する。 (57) Abstract An image processing apparatus and method for classifying symbols, such as text, in a video stream uses a back-propagation neural network in which a feature space is obtained from size-, translation-, and rotation-invariant shape-dependent features. Various exemplary feature spaces discuss, for example, histograms of angles resulting from Delaunay triangulations of refinement symbols that are regular and have constant moments and thresholds. Such a feature space provides a good match for BPNN as a classifier due to the poor resolution of characters in the video stream.

Description

Detailed Description of the Invention

【０００１】[0001]

TECHNICAL FIELD OF THE INVENTION

本発明は、１９９９年８月９日に出願された“ビデオフレームにおいて検出
されたテキストを用いてビデオコンテンツを解析するシステム及び方法”なる名
称の米国特許出願第09/370,931号に記載されたものに関する。上記出願は本発明
の譲受人に共通に譲渡されたもので、その全体は参照により恰も本明細書に完全
に記載されているかのように本明細書に組み込まれるものとする。また、本発明
は、１９９９年１月２８日に出願された“ビデオにおけるテキストの検出及び位
置特定のための方法及び装置”なる名称の米国予備特許出願第60/117,658号に開
示されたものにも関する。上記関連予備特許出願は本発明の譲受人に共通に譲渡
されたもので、該予備特許出願の開示内容は、全ての目的のために、参照により
恰も本明細書に完全に記載されているかのように本明細書に組み込まれるものと
する。また、本発明は、本出願と同時に出願された“ニューラルネットワークに
供給される形状特徴を伴う記号の分類”なる名称の出願に開示されたものにも関
する。この出願は本発明の譲受人に共通に譲渡されたもので、該関連予備特許出
願の開示内容は、全ての目的のために、参照により恰も本明細書に完全に記載さ
れているかのように本明細書に組み込まれるものとする。The present invention is described in US patent application Ser. No. 09 / 370,931 entitled "System and Method for Analyzing Video Content Using Text Detected in Video Frames", filed August 9, 1999. Regarding The above applications are commonly assigned to the assignee of the present invention and are hereby incorporated by reference in their entirety as if fully set forth herein. The present invention is also disclosed in US Preliminary Patent Application No. 60 / 117,658 entitled "Method and Apparatus for Text Detection and Localization in Video", filed January 28, 1999. It also concerns. The above-referenced preliminary patent application is commonly assigned to the assignee of the present invention, and the disclosure content of the preliminary patent application is, for all purposes, fully described herein by reference. As such is hereby incorporated by reference. The invention also relates to what is disclosed in the application entitled "Classification of symbols with shape features supplied to neural networks" filed at the same time as the present application. This application is commonly assigned to the assignee of the present invention, and the disclosure content of the related preliminary patent application is, for all purposes, as if fully set forth herein by reference. Incorporated herein.

【０００２】[0002]

BACKGROUND OF THE INVENTION

本発明は、デジタル化された画像におけるパターンを認識するシステムに関し
、更に詳細にはビデオデータストリームにおけるテキスト文字のような記号を分
離するシステムに関する。This invention relates to systems for recognizing patterns in digitized images, and more particularly to systems for separating symbols such as text characters in video data streams.

【０００３】リアルタイム放送、アナログテープ及びデジタルビデオは、教育、娯楽及びマ
ルチメディアアプリケーションのホストにとって重要である。数百万時間ともな
るビデオコレクションの大きさの場合、斯かる題材が一層効果的に使用及びアク
セスされるのを可能にするには、ビデオデータを解釈する技術が必要である。斯
様な種々の高度の使用法が提案されている。例えば、テキスト及び音声認識の使
用は、元のビデオの要約の作成、及びビデオコンテンツを索引付けするためのキ
ーの自動的生成につながり得る。他の一連のアプリケーションは、放送された（
又はマルチキャスト等をされた）ビデオデータストリームにおけるテキスト及び
／又は他の記号の高速なリアルタイム分類に依存している。例えば、テキスト認
識は、例えばビデオコンテンツの索引付け等の如何なる適した目的にも使用する
ことができる。Real-time broadcasting, analog tape and digital video are important for a host of educational, entertainment and multimedia applications. Given the size of video collections, which can reach millions of hours, techniques for interpreting video data are needed to enable such material to be used and accessed more effectively. Various such advanced uses have been proposed. For example, the use of text and voice recognition can lead to the creation of a digest of the original video and the automatic generation of keys to index the video content. Another series of applications was broadcast (
Relies on fast real-time classification of text and / or other symbols in a video data stream (or otherwise multicast). For example, text recognition can be used for any suitable purpose, such as indexing video content.

【０００４】種々のテキスト認識技術が、デジタル化されたパターンを認識するために使用
されている。最も一般的な例は、文書の光学文字認識（ＯＣＲ）である。これら
技術の一般モデルは、画像から入力ベクトルが抽出され、該入力ベクトルが生の
パターンを特徴付けるものであるというものである。該ベクトルは、上記画像を
“認識”するために固定数の又は一連の記号分類にマップされる。例えば、ビッ
トマップ画像のピクセル値が入力画像として作用することができ、対応する分類
の集合は例えばアルファベット、例えば英語のアルファベットであり得る。パタ
ーン認識に関する特定の技術が、普遍的な優越性を達成しているということはな
い。各々の認識上の問題が、自身の一連のアプリケーションの困難さを有してい
る。即ち、分類集合の大きさ、入力ベクトルの大きさ、必要とされる速度及び正
確さ、及び他の問題である。また、信頼性も殆ど全てのアプリケーションの分野
で改善が切望されている領域である。Various text recognition techniques have been used to recognize digitized patterns. The most common example is optical character recognition (OCR) of documents. The general model of these techniques is that an input vector is extracted from the image and that the input vector characterizes the raw pattern. The vector is mapped to a fixed number or series of symbol classifications to "recognize" the image. For example, the pixel values of the bitmap image can serve as the input image, and the corresponding set of classifications can be, for example, the alphabet, for example the English alphabet. No particular technique for pattern recognition has achieved universal superiority. Each cognitive problem has its own set of application difficulties. That is, the size of the classification set, the size of the input vector, the required speed and accuracy, and other issues. Further, reliability is also an area where improvement is eagerly desired in almost all application fields.

【０００５】上述した欠点の結果、パターン認識は継続して活動的な開発の分野であり、種
々のアプリケーションが、有用性及び実用性等の知覚される利点に基づき種々の
程度の注目を受けている。多分、これら技術の最も成熟したものは、パターン認
識のテキスト文字への応用、即ち光学文字認識（ＯＣＲ）である。この技術は、
印刷物のコンピュータが読み取り可能な文字への変換の望ましさ及び実用性によ
り、発展した。実用性の観点からは、印刷文書は比較的明確で且つ一貫性のある
データ源を提供する。斯様な文書は、一般に、一様な背景に対する高コントラス
トなパターンにより特徴付けられ、高解像度で記憶することが可能である。例え
ば、印刷文書は、任意の解像度で走査されて印刷文字の二値画像を形成すること
ができる。また、斯様なパターン認識の用途に対しては、コンピュータに基づく
テキストへの文書の変換がキーボード筆写の作業を回避する、データ記憶の経済
性を実現する及び文書が検索されるのを可能にする等の点で明確な要望が存在す
る。As a result of the shortcomings mentioned above, pattern recognition is an area of continuous active development, where different applications have received varying degrees of attention based on perceived advantages such as usefulness and practicality. There is. Perhaps the most mature of these techniques is the application of pattern recognition to text characters, or optical character recognition (OCR). This technology
It has evolved due to the desirability and practicality of converting printed matter into computer-readable characters. From a practical point of view, printed documents provide a relatively clear and consistent source of data. Such documents are generally characterized by high contrast patterns against a uniform background and can be stored at high resolution. For example, a printed document can be scanned at any resolution to form a binary image of printed characters. Also, for such pattern recognition applications, computer-based conversion of documents into text avoids the task of keyboard transcription, realizes the economics of data storage and allows documents to be retrieved. There is a clear demand for doing so.

【０００６】幾つかのアプリケーションの分野は、記号又は文字分類の実行の付随する困難
さのために、僅かの注目しか受けていない。例えば、ビデオストリームにおける
パターンの認識は、少なくとも以下の要因により困難な領域である。即ち、ビデ
オストリームにおける文字は、空間的に一様でない（時には、時間的に変化する
）背景に対して、劣った解像度及び低コントラストで提示される傾向にある。従
って、ビデオストリームにおける文字認識は困難であり、信頼性のある方法は未
知である。加えて、幾つかのアプリケーションに関しては、上述した関連出願に
開示されているように、少なくとも高速な認識速度が極めて望ましい。[0006] Some application areas have received little attention due to the attendant difficulties of performing symbolic or character classification. For example, pattern recognition in a video stream is a difficult area due to at least the following factors. That is, characters in a video stream tend to be presented with poor resolution and low contrast against a spatially non-uniform (sometimes time varying) background. Therefore, character recognition in a video stream is difficult and a reliable method is unknown. In addition, for some applications, at least a fast recognition speed is highly desirable, as disclosed in the above-mentioned related applications.

【０００７】ビデオを索引付けすると共に分類するシステム及び方法は、以下のものを含む
多数の文献に記載されている。即ち、１９９６年ボストンのＡＣＭマルチメディ
アの会報、第４２７〜４２８頁のM. Abdel-Mottaleb他による“CONIVAS：コンテ
ンツに基づく画像及びビデオアクセスシステム”；１９９４年シアトルのＡＣＭ
マルチメディアの会報、第３１３〜３２４頁のS-F. Chang他による“ビデオＱ：
ビジュアルキューを用いた自動化されたコンテンツに基づくビデオ検索システム
”；１９９５年、ＡＣＭのComm.第３８巻、第４号、第５７〜５８頁のM. Christ
el他による“インフォメディア・デジタル・ビデオ・ライブラリ”；１９９８年
、１１月の知識及びデータ工学に関するＩＥＥＥ会報の、N. Dimitrova他による
“消費者向け装置におけるビデオコンテンツ管理”；１９９８年８月、ブリスベ
ンのパターン認識に関する国際会議、第９１６〜９１８頁のU. Gargi他による“
デジタルビデオデータベースにおけるテキストイベントの索引付け”；１９９６
年８月のコンシューマ・エレクトロニクスに関するＩＥＥＥ会報、第４２巻、第
３号のM.K. Mandal他による“モーメント及びウェーブレットを用いた画像の索
引付け”；及び１９９６年のジャーナル・オン・ビジュアル・コミュニケーショ
ン・アンド・イメージ・リプリゼンテーション、第７巻、第４号、第３４５〜３
５３頁のS. Pfeiffer他による“デジタルムービーの自動的要約”である。Systems and methods for indexing and classifying videos are described in numerous publications, including: That is, "ConIVAS: Content-Based Image and Video Access Systems" by M. Abdel-Mottaleb et al., Pp. 427-428, 1996 ACM Multimedia, Boston; ACM, Seattle, 1994.
Multimedia bulletin, pages 313-324, SF. Chang et al., "Video Q:
Automated Content-Based Video Retrieval System Using Visual Cues "; 1995, ACM Comm. Vol. 38, No. 4, M. Christ, 57-58.
"Infomedia Digital Video Library" by el et al .; IEEE Bulletin on Knowledge and Data Engineering, November 1998, "Video Content Management on Consumer Devices" by N. Dimitrova et al., August 1998. Brisbane's International Conference on Pattern Recognition, 916-918, by U. Gargi et al.
Indexing Text Events in Digital Video Databases "; 1996
August 2015, IEEE Electronics Bulletin, Volume 42, Issue 3, MK Mandal et al., "Indexing Images Using Moments and Wavelets," and 1996 Journal on Visual Communication and. Image Representation, Volume 7, Issue 4, 345-3
"Automatic Summary of Digital Movies" by S. Pfeiffer et al., Page 53.

【０００８】局部的しきい処理及び隣接する領域間のグレイレベルの差を評価することによ
る文字を含む画像領域の検出を用いた方法による文字の抽出は、１９９４年２月
のパターン解析及びマシン知能に関するＩＥＥＥ会報、第１６巻、第２１４〜２
２４頁のOhya他による“情景画像における文字の認識”に記載されている。Ohya
他は、文字パターン候補を発生するために、密な接近性及び同様なグレイレベル
を有する検出領域を統合することを更に開示している。Extraction of characters by a method using local thresholding and detection of image areas containing characters by assessing differences in gray levels between adjacent areas is described in February 1994 for pattern analysis and machine intelligence. IEEE Bulletin on Volume 16, Volumes 214-2
See "Recognition of Characters in Scene Images" by Ohya et al., Page 24. Ohya
Others further disclose integrating detection regions with close proximity and similar gray levels to generate character pattern candidates.

【０００９】テキストを検出するために、ビデオテキストの空間的前後関係及び高コントラ
スト特性を使用して、水平方向及び垂直方向のエッジが互いに極接近した領域を
統合することは、１９９５年の言語及び視覚の統合に関する計算モデルについて
のAAAI１９９５年秋シンポジウムのA. Hauptmann他による“ビデオ・セグメンテ
ーションに関するテキスト、音声及び視覚：インフォメディア・プロジェクト”
に記載されている。R. Lienhart及びF. Suberは、ビデオ画像における色の数を
減少させる非線型カラーシステムを、１９９６年１月の画像及びビデオ処理に関
するＳＰＩＥ会議の“ビデオ索引付けのための自動テキスト認識”において論じ
ている。該文献は、同様のカラーを有する一様なセグメントを生成するための分
割／統合処理を記載している。Lienhart及びSuberは、前景文字、白黒又は厳密
な文字、大きさの制限された文字及び周囲の領域に較べて高いコントラストを持
つ文字を含むような、一様な領域における文字を検出するために、種々の発見的
手法を用いている。Using the spatial context and high-contrast properties of video text to detect text, merging regions where horizontal and vertical edges are in close proximity to each other has been described in 1995 Language and A. Hauptmann et al., "Text, Audio and Vision on Video Segmentation: Infomedia Project" at the AAAI Fall 1995 Symposium on Computational Models for Visual Integration.
It is described in. R. Lienhart and F. Suber discuss a non-linear color system that reduces the number of colors in video images in "Automatic Text Recognition for Video Indexing" at the January 1996 SPIE Conference on Image and Video Processing. ing. The document describes a split / consolidate process to generate uniform segments with similar colors. Lienhart and Suber detect characters in a uniform area, including foreground characters, black and white or strict characters, characters of limited size and characters with high contrast compared to the surrounding area, It uses various heuristics.

【００１０】テキストを突き止め並びに画像を複数の実前景及び背景画像に分離するための
多値画像分解法の使用は、１９９８年のＩＥＥＥパターン認識の会報、第３１巻
、第2055〜2076頁のA.K. Jain及びB. Yuによる“画像及びビデオフレームにおけ
る自動的テキスト検索”に記載されている。J-C. Shim他は、１９９８年のパタ
ーン認識に関する国際会議の会報、第６１８〜６２０頁の“コンテンツに基づく
注釈及び検索のためのビデオからの自動的テキスト抽出”において、一様な領域
を見つけると共にテキストをセグメント化し及び抽出するための一般化された領
域ラベル付けアルゴリズムの使用を説明している。識別された前景画像が、テキ
ストのカラー及び位置を決定するためにクラスタ化される。The use of multi-valued image decomposition methods to locate text and separate the image into multiple real foreground and background images is described in the 1998 IEEE Pattern Recognition Bulletin, Volume 31, pp. 2055-2076, AK. It is described in "Automatic Text Search in Image and Video Frames" by Jain and B. Yu. JC. Shim et al. Found a uniform area in the 1998 International Conference on Pattern Recognition, Bulletin, 618-620, "Automatic Text Extraction from Video for Content-Based Annotation and Search." It describes the use of a generalized region labeling algorithm for segmenting and extracting text. The identified foreground images are clustered to determine the color and position of the text.

【００１１】画像セグメント化に関する他の有効なアルゴリズムは、K.V. Mardia他により
１９８８年のパターン認識及びマシン知能に関するＩＥＥＥ会報、第１０巻、第
９１９〜９２７頁の“画像セグメント化用の空間的しきい処理方法”に、A. Per
ez他により１９８７年のパターン認識及びマシン知能に関するＩＥＥＥ会報、第
９巻、第７４２〜７５１頁の“画像セグメント化用の反復しきい処理方法”に各
々記載されている。Another useful algorithm for image segmentation is KV Mardia et al., "The Spatial Threshold for Image Segmentation," 1988, IEEE Bulletin on Pattern Recognition and Machine Intelligence, Vol. 10, 919-927. Processing method ”, A. Per
ez et al., 1987, IEEE Bulletin on Pattern Recognition and Machine Intelligence, Vol. 9, pages 742-751, entitled "Iterative Thresholding Methods for Image Segmentation".

【００１２】デジタル化されたビットマップにおいてテキストを突き止めるための種々の方
法が既知である。文字データを二進化して白上黒（black-on-white）として特徴
付けることが可能な画像を形成すると共に、ビットマップ画像上で文字認識を実
行する技術も既知である。ビデオストリームにおけるテキスト及び他のパターン
は、分類するのが容易な予測可能で、大きく且つ明確なものから、補助的な前後
関係データの助け無しで分類するには原理的にも不十分な情報しか含まないよう
な粗く、速く過ぎ去り且つ予測不可能なように傾き及び位置されたものまでにわ
たる。また、認識速度及び精度を増加させる進行中の研究も存在する。従って、
現状技術においては、特にビデオストリームデータのようなアプリケーションが
現行技術に過大な負荷を掛けるような場合には、改善の余地がある。Various methods are known for locating text in digitized bitmaps. A technique is known in which character data is binarized to form an image that can be characterized as black-on-white, and character recognition is performed on a bitmap image. The text and other patterns in the video stream are predictable, easy to classify, large and clear, so that there is in principle insufficient information to classify without the aid of auxiliary contextual data. Rough, fast-passing and unpredictable slopes and positions that do not include. There are also ongoing studies to increase recognition speed and accuracy. Therefore,
In the state of the art, there is room for improvement, especially when applications such as video stream data place an excessive load on the state of the art.

【００１３】[0013]

DISCLOSURE OF THE INVENTION

短く述べると、ビデオストリームにおける記号、例えばテキストを分類する画
像処理装置及び方法は、特徴空間がサイズ、並進及び回転不変な形状依存特徴か
ら得られる逆伝搬ニューラルネットワーク（ＢＰＮＮ）を用いる。通常、特徴空
間は、認識すべき記号により占められる空間である。形状依存特徴は、記号の直
線部分としてもよい。この特徴空間における形状依存特徴を用いた場合、関係す
る記号に対する特徴ベクトルとして用いられる角度のヒストグラムが形成される
。この特徴ベクトルは、様々な候補記号(candidate symbol)を生成するＢＰＮＮ
に与えられる。ＢＰＮＮが本発明に用いられるのに対し、ＢＰＮＮ及びこれのト
レーニングは、例えば、米国特許番号第４，９１２，６５４号に記載の公知技術
を用いている。様々な例示的特徴空間は、例えば、規則的であり、一定のモーメ
ント及びしきい値を持つ細化記号のドロネーの三角形分割から得られる角度のヒ
ストグラムを論じている。このような特徴空間は、ビデオストリームにおける文
字の劣った解像度のために、分類器としてのＢＰＮＮに対して良好な整合を提供
する。形状依存特徴空間は、本出願において述べられた上記技術を用いた文字範
囲の正確な分離により実施可能である。Briefly, image processing apparatus and methods for classifying symbols, such as text, in video streams use Back Propagation Neural Networks (BPNNs), where the feature space is derived from size-, translation- and rotation-invariant shape-dependent features. A feature space is usually the space occupied by the symbols to be recognized. The shape dependent feature may be a straight line portion of the symbol. When the shape-dependent features in this feature space are used, a histogram of angles used as a feature vector for the related symbol is formed. This feature vector is a BPNN that produces various candidate symbols.
Given to. BPNN is used in the present invention, while BPNN and its training employ known techniques, for example, as described in US Pat. No. 4,912,654. Various exemplary feature spaces discuss regular angle histograms of angles resulting from Delaunay triangulations of refinement symbols with regular moments and thresholds, for example. Such a feature space provides a good match for BPNN as a classifier due to the poor resolution of characters in the video stream. The shape-dependent feature space can be implemented by precise separation of character ranges using the techniques described above in this application.

【００１４】ビデオストリームに現れるテキストを検出し分類する能力は、多くの用途を有
している。例えば、ビデオ場面及びその一部を、斯様なテキストから導出される
分類に基づいて特徴付け及び索引付けすることができる。これは、索引付け、高
度な検索能力、注釈機能等につながり得る。加えて、ビデオストリームにおける
テキストの認識は、放送されたビデオストリームにおけるウェブアドレスの出現
に応答して発生されるウェブサイトへの呼出可能リンクのような、前後関係に反
応する機能の提示を可能にすることができる。The ability to detect and classify text that appears in a video stream has many applications. For example, video scenes and portions thereof can be characterized and indexed based on classifications derived from such text. This can lead to indexing, advanced search capabilities, annotation features, etc. In addition, the recognition of text in the video stream enables the presentation of context sensitive features, such as callable links to websites generated in response to the appearance of web addresses in the broadcast video stream. can do.

【００１５】ビデオにおけるテキストは、よく開発されているが依然として成熟している技
術である文書ＯＣＲのものとは非常に異なる問題を提起する。文書におけるテキ
ストは、単色であると共に高品質である傾向にある。ビデオにおいては、縮尺さ
れた情景画像は雑音及び制御されていない照度を含み得る。ビデオに現れる文字
は、変化する色、サイズ、字体、向き、太さ及び背景のものであり得、複雑であ
ると共に時間的に変化し得る、等である。また、ビデオ記号認識に関する多くの
アプリケーションは、高い速度を必要とする。Text in video poses a very different problem than that of document OCR, a well-developed but still mature technology. Text in documents tends to be solid and of high quality. In video, scaled scene images may include noise and uncontrolled illumination. The characters that appear in the video can be of varying color, size, font, orientation, thickness and background, can be complex and time varying, and so on. Also, many applications for video symbol recognition require high speed.

【００１６】ビデオテキストを分類するために本発明により採用された技術は、記号分離の
ために正確な高速技術を使用する。記号のビットマップが、次いで、形状依存型
の特徴ベクトルを発生するために使用され、該ベクトルはＢＰＮＮに供給される
。特徴ベクトルは、全体の画像形状に対して大きな強調を与える一方、上述した
変動性の問題に対しては比較的不感的である。文字領域を分離する技術において
は、接続構成部分構造（connected component structures）が、検出されたエ
ッジに基づいて定義される。エッジ検出は、記号により占められる全フィールド
を二値化するよりも全体として遙かに少ないピクセルしか生じないので、接続構
成部分を発生する処理は大幅に高速であり得る。特徴空間の選択も、認識速度を
向上させる。シミュレートされたＢＰＮＮによれば、入力ベクトルのサイズはス
ループットに重大に影響し得る。使用される構成部分に関しては選択された特徴
空間から選択的であるということが非常に重要である。勿論、モーメント及び線
分特徴のような異なる特徴の混合を組み合わせることにより、不均質な特徴空間
を形成することもできる。また、選択された特徴が計算ステップを共用する場合
は、計算の経済性を実現することもできる。The technique adopted by the present invention to classify video text uses an accurate and fast technique for symbol separation. The symbol bitmap is then used to generate a shape dependent feature vector, which is fed to the BPNN. While the feature vector gives a great emphasis to the overall image shape, it is relatively insensitive to the variability problem mentioned above. In the technique of separating character regions, connected component structures are defined based on the detected edges. Since edge detection results in far fewer pixels overall than binarizing the entire field occupied by the symbol, the process of generating connection components can be significantly faster. The choice of feature space also improves recognition speed. According to the simulated BPNN, the size of the input vector can significantly affect the throughput. It is very important that the components used are selective from the selected feature space. Of course, it is also possible to form a heterogeneous feature space by combining a mixture of different features such as moments and line segment features. Also, the economics of computation can be realized if the selected features share the computation steps.

【００１７】以下、本発明を、該発明が一層完全に理解されるように添付図面を参照して幾
つかの好ましい実施例に関連して説明する。図に関しては、図示される詳細は、
例示的、且つ、本発明の好ましい実施例の説明の目的のためのみのものであり、
本発明の原理及び思想面の最も有効且つ容易に理解できる説明であると信じるも
のを提供するために提示されたものである。この点に関しては、本発明の構造的
詳細は本発明の基本的理解に必要とされる以上に詳細に示そうとは試みられてお
らず、図面と共になされる説明は、当業者に対し本発明の幾つかの形態が如何に
実際に実施化することができるかを明らかにするものである。The invention will now be described in connection with some preferred embodiments with reference to the accompanying drawings for a more complete understanding of the invention. Regarding the figures, the details shown are:
For purposes of illustration only, and for purposes of describing the preferred embodiment of the invention,
It is presented in order to provide what is believed to be the most effective and easily understandable description of the principles and ideas of the present invention. In this regard, the structural details of this invention are not intended to be shown in any more detail than is necessary for a basic understanding of the invention, and the description taken in conjunction with the drawings will explain to those skilled in the art the invention. It is clear how some forms of can be implemented in practice.

【００１８】[0018]

DETAILED DESCRIPTION OF THE INVENTION

図１を参照すると、画像テキスト解析システム１００はビデオ処理装置１１０
、ビデオソース１８０及び、多分、モニタ１８５を採用して、ビデオ入力を受信
すると共に、該ビデオに埋め込まれた文字情報を発生し及び記憶する。ビデオ処
理装置１１０は、以下に詳述する手順に従ってビデオ画像を入力し、フレームを
解析し、テキスト領域及び文字領域を分離し並びにテキスト及び／又は文字領域
を分類する。ビデオは、ビデオソース１８０から供給される。ビデオソース１８
０は、アナログ／デジタル変換器（ＡＤＣ）を備えるＶＣＲ、デジタル化された
ビデオを伴うディスク、ＡＤＣを備えるケーブルボックス、ＤＶＤ又はＣＤ−Ｒ
ＯＭドライブ、デジタルビデオホームシステム（ＤＶＨＳ）、デジタルビデオレ
コーダ（ＤＶＲ）及びハードディスクドライブ（ＨＤＤ）等を含むビデオデータ
の如何なるソースでもよい。ビデオソース１８０は、幾つかの短いクリップ、又
は長い長さのデジタル化されたビデオ画像を含む複数クリップを提供することが
できるものであってもよい。ビデオソース１８０は、ビデオデータをＭＰＥＧ−
２及びＭＪＰＥＧのような如何なるアナログ又はデジタルフォーマットで供給し
てもよい。Referring to FIG. 1, an image text analysis system 100 includes a video processing device 110.
, Video source 180 and possibly monitor 185 to receive the video input and to generate and store textual information embedded in the video. The video processing device 110 inputs a video image, analyzes a frame, separates text and character regions, and classifies text and / or character regions according to the procedure detailed below. The video is provided by video source 180. Video source 18
0 is a VCR with an analog-to-digital converter (ADC), a disc with digitized video, a cable box with an ADC, a DVD or a CD-R
It can be any source of video data including OM drives, Digital Video Home Systems (DVHS), Digital Video Recorders (DVR), Hard Disk Drives (HDD), etc. Video source 180 may be capable of providing several short clips, or multiple clips containing long lengths of digitized video images. The video source 180 MPEG-converts the video data.
2 and MJPEG in any analog or digital format.

【００１９】ビデオ処理装置１１０は、画像プロセッサ１２０、ＲＡＭ１３０、記憶部１４
０、ユーザＩ／Ｏカード１５０、ビデオカード１６０、Ｉ／Ｏバッファ１７０及
びプロセッサバス１７５を含むことができる。プロセッサバス１７５は、ビデオ
処理装置１１０の種々の構成要素間でデータを伝送する。ＲＡＭ１３０は、画像
テキストワークスペース１３２及びテキスト解析コントローラ１３４を更に有し
ている。画像プロセッサ１２０は、ビデオ処理装置１１０の全体の制御を行うと
共に、システム選択及びユーザ選択属性に基づいてビデオフレーム内のテキスト
を解析することを含み、画像テキスト解析システム１００にとり必要な画像処理
を実行する。これは、編集処理の実行、モニタ１８５上での表示及び／又は記憶
部１４０への記憶のためのデジタル化されたビデオ画像の処理、及び画像テキス
ト解析システム１００の種々の構成要素間でのデータの伝送も含む。画像プロセ
ッサ１２０にとっての要件及び能力は、従来周知であり、本発明にとり必要な場
合以外は、詳細に説明することを要さないであろう。The video processing device 110 includes an image processor 120, a RAM 130, and a storage unit 14.
0, user I / O card 150, video card 160, I / O buffer 170 and processor bus 175. The processor bus 175 transfers data between various components of the video processing device 110. The RAM 130 further includes an image text workspace 132 and a text analysis controller 134. The image processor 120 provides overall control of the video processing device 110 and includes parsing the text in the video frame based on system selected and user selected attributes to perform the necessary image processing for the image text analysis system 100. To do. This includes performing edit processing, processing digitized video images for display on monitor 185 and / or storage in storage 140, and data between various components of image text analysis system 100. Including the transmission of. The requirements and capabilities for image processor 120 are well known in the art and need not be described in detail except as required by the present invention.

【００２０】画像プロセッサ１２０は、適当にプログラムされたコンピュータ回路として実
行されてもよい。プログラムメモリにロードされた一連の命令は、このコンピュ
ータ回路に図２から図１１Ｂを参照して以下に説明される動作を行わせる。これ
ら一連の命令は、例えばディスクのような一連の命令を含んだ担体を読み取るこ
とでプログラムメモリ内にロードされてもよい。この読み取りは、例えばインタ
ーネットのような通信ネットワークを介して行われてもよい。この場合、サービ
スプロバイダは、これら一連の命令を利害関係者(interested party)に利用可能
にさせる。Image processor 120 may be implemented as a suitably programmed computer circuit. A series of instructions loaded into the program memory causes this computer circuit to perform the operations described below with reference to Figures 2-11B. The series of instructions may be loaded into the program memory by reading a carrier containing the series of instructions, such as a disc. This reading may be done via a communication network such as the Internet. In this case, the service provider makes these sets of instructions available to interested parties.

【００２１】ＲＡＭ１３０は、当該システム内の構成要素によってはそれ以外では設けられ
ていないような、ビデオ処理装置１１０により生成されるデータの一時記憶用の
ランダムアクセスメモリを提供する。ＲＡＭ１３０は、画像テキストワークスペ
ース用のメモリ１３２及びテキスト解析コントローラ１３４、並びに画像プロセ
ッサ１２０及び関連する装置により必要とされる他のメモリを含んでいる。画像
テキストワークスペース１３２は、ＲＡＭ１３０のうちの、特定のビデオクリッ
プに関連するビデオ画像がテキスト解析処理の間に一時的に記憶される部分を表
している。画像テキストワークスペース１３２は、元のデータが後で回復され得
るように、該元のデータに影響を与えることなくフレームのコピーが修正される
のを可能にする。RAM 130 provides random access memory for temporary storage of data generated by video processing device 110, which is otherwise not provided by the components within the system. RAM 130 includes memory 132 for the image text workspace and text parsing controller 134, as well as other memory required by image processor 120 and associated devices. Image text workspace 132 represents the portion of RAM 130 in which the video image associated with a particular video clip is temporarily stored during the text parsing process. The image text workspace 132 allows a copy of a frame to be modified without affecting the original data so that the original data can be recovered later.

【００２２】本発明の一実施例においては、テキスト解析コントローラ１３４は、ＲＡＭ１
３０のうちの、画像プロセッサ１２０により実行されるシステム又はユーザ定義
のテキスト属性に基づくビデオ画像の解析を行うアプリケーションプログラムの
記憶専用の部分を表している。テキスト解析コントローラ１３４は、モーフィン
グ又は情景の間の境界検出のような周知の編集技術、及び本発明に関連するビデ
オテキスト認識用の新規な技術を実行することができる。テキスト解析コントロ
ーラ１３４は、記憶部１４０又はビデオソース１８０のような他の場所における
取り外し可能なディスクポートに装填することができるＣＤ−ＲＯＭ、コンピュ
ータディスケット又は他の記憶媒体上のプログラムとして実施化することもでき
る。In one embodiment of the invention, the text parsing controller 134 is RAM 1
10 represents a storage-only portion of 30 that is executed by the image processor 120 or an application program that analyzes a video image based on user-defined text attributes. The text parsing controller 134 may perform well-known editing techniques such as morphing or boundary detection between scenes, and novel techniques for video text recognition relevant to the present invention. The text parsing controller 134 may be implemented as a program on a CD-ROM, computer diskette or other storage medium that can be loaded into a removable disk port elsewhere in the storage 140 or video source 180, such as a video source 180. You can also

【００２３】記憶部１４０は、所要のビデオ及びオーディオデータを含む、プログラム及び
他のデータの永久記憶のために、取り外し可能なディスク（磁気式又は光学式）
を含む１以上のディスクシステムを有している。システム要件に依存して、記憶
部１４０は、ビデオソース（又は複数のソース）及び当該システムの残部への、
並びに斯かるソース及び当該システムの残部からのビデオ及びオーディオデータ
の伝送のために１以上の双方向バスとインターフェースするように構成すること
ができる。記憶部１４０は、必要に応じて、データをビデオ速度で伝送すること
ができる。また、記憶部１４０は、テキスト属性解析を含む編集の目的で、数分
のビデオのための充分な記憶を提供するような大きさのものである。特定のアプ
リケーション及び画像プロセッサ１２０の能力に応じて、記憶部１４０は多数の
ビデオクリップを記憶する能力を有するように構成することができる。Storage 140 is a removable disk (magnetic or optical) for permanent storage of programs and other data, including required video and audio data.
Have one or more disk systems including. Depending on the system requirements, the storage 140 may include a video source (or sources) and the rest of the system,
And can be configured to interface with one or more bidirectional buses for transmission of video and audio data from such sources and the rest of the system. The storage unit 140 can transmit data at a video speed if necessary. Storage 140 is also sized to provide sufficient storage for a few minutes of video for editing purposes, including text attribute analysis. Depending on the particular application and the capabilities of the image processor 120, the storage 140 may be configured to have the capability of storing multiple video clips.

【００２４】ユーザＩ／Ｏカード１５０は、図示せぬ種々のユーザ装置（又は複数の装置）
を当該画像テキスト解析システム１００の残部とインターフェースすることがで
きる。ユーザＩ／Ｏカード１５０は、ユーザ装置から入力されたデータをインタ
ーフェースバス１７５のフォーマットに変換して、画像プロセッサ１２０へ、又
は該画像プロセッサ１２０による後のアクセスのためにＲＡＭ１３０へ伝送する
。また、ユーザＩ／Ｏカード１５０は、データをプリンタ（図示略）のようなユ
ーザ出力装置へ伝送する。ビデオカード１６０は、データバス１７５を介するモ
ニタ１８５とビデオ処理装置１１０との間のインターフェースを提供する。The user I / O card 150 includes various user devices (or a plurality of devices) not shown.
Can interface with the rest of the image text analysis system 100. The user I / O card 150 converts the data input from the user device into the format of the interface bus 175 and transmits it to the image processor 120 or to the RAM 130 for later access by the image processor 120. The user I / O card 150 also transmits data to a user output device such as a printer (not shown). Video card 160 provides an interface between monitor 185 and video processing device 110 via data bus 175.

【００２５】Ｉ／Ｏバッファ１７０は、ビデオソース１８０と当該画像テキスト解析システ
ム１００の残部との間のバス７５を介するインターフェースを行う。前述したよ
うに、ビデオソース１８０はＩ／Ｏバッファ１７０とインターフェースするため
に少なくとも１つの双方向バスを有している。Ｉ／Ｏバッファ１７０は、ビデオ
ソース１８０へ／からデータを所要のビデオ画像伝送速度で伝送する。ビデオ処
理装置１１０内では、Ｉ／Ｏバッファ１７０はビデオソース１８０から入力した
データを、必要に応じて、記憶部１４０へ、画像プロセッサ１２０へ又はＲＡＭ
１３０へ伝送する。ビデオデータの画像プロセッサ１２０への同時伝送は、該ビ
デオデータが入力されるにつれて同ビデオデータを表示する手段を提供する。The I / O buffer 170 provides an interface between the video source 180 and the rest of the image text parsing system 100 via the bus 75. As mentioned above, the video source 180 has at least one bidirectional bus to interface with the I / O buffer 170. The I / O buffer 170 transfers data to / from the video source 180 at the required video image transfer rate. In the video processing device 110, the I / O buffer 170 stores the data input from the video source 180 to the storage unit 140, the image processor 120, or the RAM as necessary.
To 130. Simultaneous transmission of video data to the image processor 120 provides a means of displaying the video data as it is input.

【００２６】図２、３Ａ及び３Ｂを参照すると、（図２に概要を示すような）テキスト抽出
及び認識処理１００をビデオ処理装置１００又は何れかの他の適切な装置により
図３Ａ及び３Ｂに示すもののようなテキストを含むビデオシーケンスに対して実
行することができる。個々のフレーム３０５に図２に概要を示すような手順を施
し、結果として３１０、３１５、３６０、３６５、３７０及び３７５のような個
々のテキスト領域が分離される。該手順は、背景の複雑さを低減しテキストの明
瞭さを増加させるよう積分された複数フレームの積分に対して適用することもで
きることに注意すべきである。即ち、多数の連続するフレームが同一のテキスト
領域を含む場合（そして、これは上記テキスト領域が同様なスペクトル密度関数
のような略同一の信号特性を含む場合に識別することができる）、複数の連続す
るフレームは積分する（例えば、平均化する）ことができる。これは、テキスト
領域を一層明瞭にし、テキストが背景に対して一層良好に区別されるようにする
傾向がある。背景が動画である場合、この手順により該背景の複雑さは必然的に
低減される。このような信号平均化の利点の幾つかは、最近のテレビジョンにお
けるように動画の向上のために時間積分がなされるようなソースからも得ること
もできることに注意されたい。このように、以下の説明に関しては、“単一の”
フレームに対する処理の概念は、決して単一の“フレームの掴み”に限定される
ものではなく、画像解析がなされる“フレーム”は、１以上の連続したビデオフ
レームの複合とすることができる。Referring to FIGS. 2, 3A and 3B, a text extraction and recognition process 100 (as outlined in FIG. 2) is shown in FIGS. 3A and 3B by a video processing device 100 or any other suitable device. It can be performed on a video sequence containing text like things. The individual frames 305 are subjected to the procedure outlined in FIG. 2, resulting in the separation of individual text areas such as 310, 315, 360, 365, 370 and 375. It should be noted that the procedure can also be applied to multi-frame integrations integrated to reduce background complexity and increase text clarity. That is, if multiple consecutive frames contain the same text region (and this can be identified if the text regions contain similar signal characteristics such as similar spectral density functions), then multiple Successive frames can be integrated (eg, averaged). This tends to make the text areas clearer and the text better distinguished against the background. If the background is an animation, this procedure necessarily reduces the complexity of the background. It should be noted that some of the advantages of such signal averaging can also be obtained from sources where the time integration is done for motion picture enhancement, as in modern televisions. Thus, for the discussion that follows, "single"
The concept of processing on frames is by no means limited to a single "frame grab", the "frame" on which the image analysis is performed can be a composite of one or more consecutive video frames.

【００２７】最初に、画像プロセッサ１２０はビデオ画像の１以上のフレームのカラーを分
離し、テキストの抽出に使用するために低減されたカラーの画像を記憶する。本
発明の一実施例においては、画像プロセッサ１２０は赤-緑-青（ＲＧＢ）カラー
空間モデルを使用してピクセルの赤成分を分離する。フレームのテキスト部分が
どの様に見えるかの一例が、図４Ａに示されている。赤成分は、たいていの場合
、ビデオテキストに対して主に使用される白、黄色及び黒のカラーを検出するの
に最も有効である。即ち、重ねられた（スーパーインポーズされた）テキストに
対しては、分離された赤のフレームは共通のテキストカラーに対してシャープな
、高コントラストのエッジをもたらす。この方法は、ビデオ上に重ねられたもの
ではなく、掲示板又は通りの看板を強調する映画の場面のように実際にビデオの
一部であるようなテキストを抽出するのにも使用することができる。このような
場合、赤のフレームは使用するのに最良のものではないであろう。斯様な場合は
、グレイスケール（アルファチャンネル）が最良のスタート点を提供するであろ
う。本発明の他の実施例においては、画像プロセッサ１２０は、グレイスケール
画像、又はＹＩＱビデオフレームのＹ成分等の種々のカラー空間モデルを使用す
ることができることに注意されたい。First, the image processor 120 separates the colors of one or more frames of the video image and stores the reduced color image for use in extracting text. In one embodiment of the present invention, image processor 120 uses a Red-Green-Blue (RGB) color space model to separate the red component of pixels. An example of what the text portion of the frame looks like is shown in Figure 4A. The red component is most often the most effective in detecting the white, yellow and black colors used primarily for video text. That is, for superimposed (superimposed) text, the isolated red frame provides sharp, high-contrast edges for common text colors. This method can also be used to extract text that is not really overlaid on the video, but is actually part of the video, such as a movie scene highlighting a bulletin board or street sign. . In such cases, the red frame would not be the best to use. In such cases, the gray scale (alpha channel) will provide the best starting point. It should be noted that in other embodiments of the invention, the image processor 120 may use various color space models, such as grayscale images, or the Y component of YIQ video frames.

【００２８】分離されたフレーム画像は画像テキストワークスペース１３２に記憶される。
次いでステップＳ２１０においては、更なる処理がなされる前に、捕捉された画
像が鮮鋭化される。例えば、下記の３ｘ３マスクを使用することができ、 −１ −１ −１ −１８ −１ −１ −１ ―１この場合において、各ピクセルは自身の８倍に、隣接する各ピクセルの負を加え
たものとなる。ビットマップフィルタ（又は“マスク”）に関する上記のマトリ
クス表現は、当業技術における普通の表記である。当業技術において既知の多く
の斯様な派生的フィルタが存在し、本発明はテキスト領域を分離するために、種
々の異なる技術の何れかをしようとするものである。上述したものは、非常に簡
単な例に過ぎない。当該フィルタ処理ステップは、例えば或る次元に沿う階調の
検出に、他の次元に沿う階調の検出（同時に各直交方向において平滑化をする）
が後続し、更にこれら２つのフィルタ処理の結果の加算が後続するような、複数
の過程を含むことができる。ステップＳ２１０においては、例えば１９９２年の
アジソン−ウェズレイ出版会社の“デジタル画像処理”にR.C. Gonzalez及びR.E
. Woodsにより記載されたようなメジアンフィルタを用いて、ランダム雑音を低
減することができる。The separated frame images are stored in the image text workspace 132.
Then, in step S210, the captured image is sharpened before further processing. For example, the following 3x3 mask can be used: -1 -1 -1 -1 -1 8 -1 -1 -1 -1 In this case, each pixel has 8 times its own negative of each adjacent pixel. It will be added. The above matrix representation for bitmap filters (or "masks") is conventional in the art. There are many such derivative filters known in the art, and the present invention seeks to use any of a variety of different techniques to separate text regions. The above is just a very simple example. The filtering step includes, for example, detecting a gradation along a certain dimension and detecting a gradation along another dimension (simultaneously smoothing in each orthogonal direction).
, Followed by the addition of the results of these two filters, and so on. In step S210, for example, RC Gonzalez and RE in "Digital Image Processing" by Addison-Wesley Publishing Company in 1992.
Random noise can be reduced using a median filter as described by Woods.

【００２９】エッジ検出は、他のエッジフィルタを使用することができる。このフィルタを
介して、鮮鋭化された（赤、グレイスケール等の）画像におけるエッジを増幅す
ることができ（好ましくは、増幅され）、非エッジは例えば下記のエッジマスク
を用いて減衰されるが、 −１ −１ −１ −１１２ −１ −１ −１ −１ここでも、各ピクセルは自身及び隣接するピクセルに適用された上記各係数（重
み）の和である。図４Ｃには、上記フィルタ処理ステップの結果が図示されてい
る。元の画像１６３はエッジフィルタ処理されて微分画像１６４が得られ、次い
で該画像がエッジ強調されて最終画像１６５が得られ、該最終画像に以下のフィ
ルタ処理が施される。Other edge filters can be used for edge detection. Through this filter, edges in a sharpened (red, grayscale, etc.) image can be amplified (preferably amplified), while non-edges are attenuated, for example using the edge mask below. , -1 -1 -1 -1 -1 12 -1 -1 -1 -1 Again, each pixel is the sum of the above coefficients (weights) applied to itself and the adjacent pixels. The result of the above filtering step is illustrated in FIG. 4C. The original image 163 is edge-filtered to obtain a differential image 164, then the image is edge-enhanced to obtain a final image 165, and the final image is subjected to the following filter processing.

【００３０】ステップＳ２１５において、しきいエッジフィルタ、即ち“エッジ検出器”が
適用される。Ｅｄｇｅ_ｍ，ｎがＭｘＮのエッジ画像のｍ，ｎピクセルを表し、Ｆ _ｍ，ｎがステップＳ２１０の結果としての向上された画像を表すとすると、エッ
ジ検出には下記の式を使用することができる。式１[0030] In step S215, the threshold edge filter or "edge detector"
Applied. Edge_{m, n}Represents m, n pixels of the M × N edge image, and F _{m, n} Represents the enhanced image resulting from step S210, then
The following formula can be used for detecting the error. Formula 1

【式１】ここで、０＜ｍ＜Ｍ及び０＜ｎ＜Ｎであり、Ｌ_edgeは一定であるか又は一定でな
いしきい値である。値ｗ_i,jは前記エッジマスクからの重みである。当該エッジ
検出処理においては最も外側のピクセルは無視することができる。ここでも、こ
のしきい処理に鮮鋭化フィルタを暗黙的に適用することができることに注意すべ
きである。[Formula 1] Here, 0 <m <M and 0 <n <N, and L _edge is a constant or non-constant threshold value. The value w _{i, j} is the weight from the edge mask. The outermost pixels can be ignored in the edge detection process. Again, it should be noted that a sharpening filter can be implicitly applied to this threshold process.

【００３１】エッジしきい値Ｌ_edgeは所定のしきい値であり、固定値又は可変値とすること
ができる。固定しきい値を使用すると、結果としてごま塩雑音が生じ、テキスト
の周辺の固定エッジに不連続を生じる。既知の開口処理（例えば、膨張処理が後
続する浸食処理）は、結果としてテキストの一部の欠損につながる。適応的しき
いエッジフィルタ、即ち可変しきい値を伴うフィルタ、は斯かる傾向を改善し、
静止しきい値の使用よりも大きな改善となる。The edge threshold L _edge is a predetermined threshold and can be a fixed value or a variable value. Using a fixed threshold results in salt and pepper noise and discontinuities in the fixed edges around the text. Known opening processes (eg erosion processes followed by dilation processes) result in the loss of some of the text. Adaptive threshold edge filters, ie filters with variable thresholds, improve on this trend,
Greater improvement over the use of quiescent thresholds.

【００３２】ステップＳ２２０においては、上記エッジ検出しきい値を調整する或るモード
において、上記エッジ検出器を用いて第１固定しきい値が適用された後、上記固
定しきい値ステップにおいて識別されたエッジピクセルに（特定の許容誤差内で
）隣接する如何なるピクセルに対しても局部的なしきい値が低下され、当該フィ
ルタが再適用される。他のモードにおいては、後者の効果は、上記しきい値ステ
ップの結果に対して平滑化関数を適用し（結果が２より大きなピクセル深度で記
憶されると仮定する）、次いでしきい処理を再び実行することにより容易に達成
することができる。これは、非エッジとして印されたピクセルが、エッジとして
印されるようにする。ピクセルに対するしきい値低下の程度は、好ましくは、エ
ッジとして印された隣接ピクセルの数に依存するようにする。これの背後にある
理由付けは、隣接するピクセルがエッジである場合、目下のピクセルがエッジで
ある可能性が高いということである。局部的なしきい値の低下から生じるエッジ
ピクセルは、隣接するピクセルに対する低減されたしきい値の計算には使用され
ない。In step S220, a first fixed threshold is applied using the edge detector in a mode of adjusting the edge detection threshold and then identified in the fixed threshold step. The local threshold is lowered and any pixel adjacent (within a certain tolerance) to the edge pixel is re-applied. In the other mode, the latter effect applies a smoothing function to the result of the threshold step (assuming the result is stored at a pixel depth greater than 2) and then rethresholds thresholding. It can be easily achieved by executing. This causes pixels marked as non-edges to be marked as edges. The degree of threshold reduction for a pixel preferably depends on the number of adjacent pixels marked as edges. The rationale behind this is that if the adjacent pixel is an edge, then the current pixel is likely to be an edge. Edge pixels resulting from local threshold reduction are not used in the reduced threshold calculation for adjacent pixels.

【００３３】他の例として、固定しきい値をローパス重み付け関数と共に使用して、強いエ
ッジピクセル（高い階調を持つピクセル）により囲まれる単一の又は少数の非エ
ッジピクセルがエッジピクセルとして印されるのを保証するようにすることもで
きる。事実、上述した全てのステップＳ２１０ないしＳ２２０は、総和に対する
、式１の形ではあるが一層広い範囲を伴う単一の数値演算により記載することが
できる。これらの個別のステップへの分離は必要である又は限定するものである
とは見なしてはならず、計算装置及びソフトウェアの細目及び他の要件に依存し
得る。As another example, using a fixed threshold with a low pass weighting function, a single or a small number of non-edge pixels surrounded by strong edge pixels (pixels with high gray levels) are marked as edge pixels. It can also be guaranteed. In fact, all the steps S210 to S220 described above can be described by a single numerical operation on the summation, in the form of equation 1, but with a wider range. Separation into these individual steps should not be considered necessary or limiting and may depend on the specifics of computing equipment and software and other requirements.

【００３４】一旦、文字のエッジが検出されると、画像プロセッサ１２０は、テキストを含
まないか又はテキストが信頼性を以って検出することができない画像領域を除去
するために、予備エッジフィルタ処理を実行する。例えば、極端に少ないエッジ
、非常に低いエッジ密度（単位面積当たりのエッジピクセルの数）又は低程度の
エッジピクセルの集合（即ち、これらは長い距離の構造を形成しない、例えば雑
音）を伴うフレームは、更なる処理から除外することができる。Once a character edge is detected, the image processor 120 may perform preliminary edge filtering to remove image regions that either do not contain text or where the text cannot be reliably detected. To execute. For example, a frame with extremely few edges, a very low edge density (number of edge pixels per unit area) or a low set of edge pixels (ie, they do not form long distance structures, eg noise) , Can be excluded from further processing.

【００３５】画像プロセッサ１２０は異なるレベルでエッジフィルタ処理を実行することが
できる。例えば、エッジフィルタ処理はフレームレベルで又は副フレームレベル
で実行することができる。上記フレームレベルにおいては、画像プロセッサ１２
０は、フレームの合理的なものよりも大きな部分がエッジからなるように見える
場合は該フレームを無視することができる。他の例として、スペクトル解析のよ
うなフィルタ処理関数を、当該フレームが過多なエッジを有しそうであるかを判
定するために適用することができる。このことは、フレーム内の高密度の強いエ
ッジの物体から生じ得る。該仮定は、過度に複雑なフレームは大きな割合の文字
でない細部を含み、これを文字分類を介してフィルタするのは不釣り合いに負担
となるということである。The image processor 120 can perform edge filtering at different levels. For example, edge filtering can be performed at the frame level or sub-frame level. At the frame level, the image processor 12
A 0 can ignore a frame if it appears that the larger part of the frame is made up of edges. As another example, a filtering function such as spectral analysis can be applied to determine if the frame is likely to have too many edges. This can result from dense, strong edged objects in the frame. The assumption is that overly complex frames contain a large proportion of non-character details, which is disproportionately expensive to filter via character classification.

【００３６】フレームレベルのフィルタ処理が使用される場合、画像プロセッサ１２０は当
該画像フレーム内のエッジピクセルの数を決定するためにエッジカウンタを維持
する。しかしながら、これは、雑音の多い部分と明瞭なテキストの部分とを伴う
フレームのような明瞭なテキストを含むフレームを跳ばし及び無視することにつ
ながり得る。斯様な画像フレーム又は画像副フレームの除外を防止するために、
画像プロセッサ１２０は副フレームレベルでエッジフィルタ処理を実行すること
もできる。これを行うに、画像プロセッサ１２０はフレームを小さな領域に分割
することができる。これを達成するため、画像プロセッサ１２０は、例えばフレ
ームを３つの群のピクセル列と３つの群のピクセル行とに分割することができる
。When frame-level filtering is used, the image processor 120 maintains an edge counter to determine the number of edge pixels in the image frame. However, this can lead to skipping and ignoring frames containing clear text, such as frames with noisy parts and parts of clear text. To prevent the exclusion of such image frames or image sub-frames,
The image processor 120 can also perform edge filtering at the sub-frame level. To do this, the image processor 120 can divide the frame into smaller regions. To achieve this, the image processor 120 can, for example, divide the frame into three groups of pixel columns and three groups of pixel rows.

【００３７】次に、画像プロセッサ１２０は各副フレームにおけるエッジの数を決定すると
共に、それに応じて関連するカウンタを設定する。副フレームが所定の数より多
くのエッジを有する場合は、上記プロセッサは該副フレームを放棄することがで
きる。領域当たりの所定の最大エッジカウント値は、画像領域を処理するのに要
する時間量、又はピクセル密度に対するサイズが認識精度を所望な最小値より低
くする確率に応じて設定することができる。解釈不能と識別された領域により囲
まれる明瞭なテキストの小さな領域が失われることに対して保証するために、大
きな数の副フレームを使用することができる。The image processor 120 then determines the number of edges in each sub-frame and sets the associated counter accordingly. If the subframe has more than a predetermined number of edges, the processor can discard the subframe. The predetermined maximum edge count value per region can be set according to the amount of time required to process the image region or the probability that the size with respect to the pixel density makes the recognition accuracy lower than the desired minimum value. A large number of subframes can be used to ensure that small areas of clear text surrounded by areas that are identified as uninterpretable are lost.

【００３８】次に、ステップＳ２２５において、画像プロセッサ１２０は先のステップにお
いて発生されたエッジに対して接続構成部分（ＣＣ）解析を実行する。この解析
は、特定の許容範囲内で連続する全てのエッジピクセルをまとめる。即ち、他の
エッジピクセルと隣接する又は斯かる他のエッジピクセルの或る距離内にある全
てのエッジピクセルは当該ピクセルと一緒に統合される。最終的に、この統合処
理は、各々がエッジピクセルの連続した又は略連続した集合を有するような構造
又は接続された構成部分を規定する。これの動機は、各テキスト文字領域は単一
のＣＣに対応すると仮定されるということである。上記の許容誤差は画像捕捉の
解像度、アップサンプリングの程度（元の画像からの補間により追加されたピク
セルの割合）又はダウンサンプリングの程度（元の画像から削除されたピクセル
の割合）に応じて如何なる好適な値にも設定することができる。Next, in step S225, the image processor 120 performs connection component (CC) analysis on the edges generated in the previous step. This analysis aggregates all consecutive edge pixels within a certain tolerance. That is, all edge pixels that are adjacent to or within a distance of another edge pixel are merged with that pixel. Finally, the integration process defines structured or connected components, each having a contiguous or nearly contiguous set of edge pixels. The motivation for this is that each text character region is assumed to correspond to a single CC. What is the above tolerance depending on the resolution of the image capture, the degree of upsampling (percentage of pixels added by interpolation from the original image) or the degree of downsampling (percentage of pixels deleted from the original image) It can be set to a suitable value.

【００３９】次に図４Ｂを参照すると、連続した文字に対応するＣＣの間の不用意なギャッ
プ又は断続部が、固定しきい値によるエッジ検出の結果として現れ得る。例えば
、１７１及び１７２に示されるもののような断続部が発生する可能性がある。記
載されるエッジ検出法の使用は、斯様な断続されたＣＣ部分の統合を保証する助
けとなる。図５Ａ及び５Ｂの左側の文字におけるような断続部から初めて、当該
ＣＣ統合方法の結果、断続部１７４、１７５及び１７６における点がエッジ点と
して識別され、１８１及び１８２におけるような単一の接続構成部分構造に各々
統合されることになる。接続された領域における“間違った”断続部を閉鎖する
ことは、上述した特定の方法に加えて、種々の方法で達成することができること
に注意すべきである。例えば、膨張処理を浸食処理又は細化処理の後で適用する
こともできる。エッジピクセルの総面積が増加する効果を防止するために、膨張
処理には接続構成部分を検出する前に細化処理を後続させることができる。また
、式１の適用の結果としての二進化されたしきい処理された画像のグレイスケー
ル深度を増加し、次いで平滑化関数を適用し、再びしきい処理（式１）を実行す
ることもできる。所望の閉鎖効果を達成するために使用することができる多くの
画像処理技術が存在する。更に他の例は、図５Ｃに図示されたもののような連続
した系列の形のエッジピクセルにより略囲まれる場合に、これらピクセルをエッ
ジピクセルとして印すというものである。即ち、図示された２４の場合の各々は
、８ピクセルの隣接するものを伴うピクセルである。これらの場合の各々におい
て、隣接するものは連続した系列の形の５個以上のエッジピクセルを有している
。勿論、上記の連続した系列における数は変更することができるか、又は特別な
場合を上記群に追加することができる。更に、マトリクスのサイズを増加させる
こともできる。図５Ｃに対して規定されるもののようなアルゴリズムによりエッ
ジとして印されるのが好まれるピクセルの型式は、ピクセルが連続した断続部の
一部ではなさそうだと見なされるようなものである。同様な結果は、閉鎖処理（
浸食処理が後続する膨張処理）により、又は前記マスクにおける一層少ない鮮鋭
化処理を若しくは前記しきい処理（式１の適用）に関して前処理を使用すること
により得ることができる。Referring now to FIG. 4B, inadvertent gaps or discontinuities between CCs corresponding to consecutive characters may appear as a result of edge detection with a fixed threshold. For example, interruptions such as those shown at 171 and 172 may occur. The use of the edge detection method described helps to ensure the integration of such interrupted CC parts. Beginning with a discontinuity as in the letters on the left side of FIGS. 5A and 5B, as a result of the CC integration method, the points at discontinuities 174, 175 and 176 are identified as edge points and a single connection configuration as in 181 and 182 It will be integrated into each substructure. It should be noted that closing the "wrong" interruptions in the connected areas can be achieved in various ways in addition to the particular way described above. For example, the expansion treatment can be applied after the erosion treatment or the thinning treatment. To prevent the effect of increasing the total area of the edge pixels, the dilation process can be followed by a thinning process before detecting the connecting component. It is also possible to increase the grayscale depth of the binarized thresholded image as a result of applying Equation 1 and then apply a smoothing function and perform the thresholding (Equation 1) again. . There are many image processing techniques that can be used to achieve the desired closure effect. Yet another example is to mark these pixels as edge pixels when they are substantially surrounded by edge pixels in the form of a continuous sequence such as the one illustrated in FIG. 5C. That is, each of the 24 cases shown is a pixel with 8 pixel neighbors. In each of these cases, the neighbors have 5 or more edge pixels in the form of a continuous sequence. Of course, the numbers in the above sequence can be changed or special cases can be added to the group. Furthermore, the size of the matrix can be increased. The type of pixel that is preferred to be marked as an edge by an algorithm such as that defined for FIG. 5C is such that the pixel is considered unlikely to be part of a continuous discontinuity. Similar results are obtained with the closing process (
Expansion treatment followed by an erosion treatment) or by using a less sharpening treatment in the mask or by using a pretreatment for the threshold treatment (application of equation 1).

【００４０】ＣＣとは、一方の部分から他方の部分を分割するような非エッジピクセルを伴
わない連続した系列を形成すると判定されたピクセルの集合である。各ＣＣにつ
いて、当該構造における最左端、最右端、最上端及び最下端のピクセルの座標を
、例えば当該構造の中心の座標のような当該構造の位置の指示子と共に含むリス
トが作成される。当該接続構成部分構造を形成するピクセルの数も記憶すること
ができる。ピクセルのカウントが特定の接続構成部分の面積を表すことに注意す
べきである。所定のシステム及び／又はユーザしきい値を、どの接続構成部分構
造を次の処理段階へ渡すべきかを判定するために当該接続構成部分構造の面積、
高さ及び幅に関する最大及び最小限界を規定するために使用することができる。
最後のステップは、ＣＣが文字としての資格を有するか否かを判定するフィルタ
である。他の発見的手法を、自身ではＣＣ発見的手法を満たすには小さ過ぎるＣ
Ｃを組み合わせるのに、又は大き過ぎるものを分割するために使用することがで
きる。A CC is a set of pixels determined to form a continuous sequence without non-edge pixels that divide one part into the other part. For each CC, a list is created that includes the coordinates of the leftmost, rightmost, topmost and bottommost pixels in the structure, along with an indicator of the position of the structure, such as the coordinates of the center of the structure. The number of pixels forming the connection component substructure can also be stored. It should be noted that the pixel count represents the area of a particular connection component. The area of a given connection component to determine a given system and / or user threshold, which connection component should be passed on to the next processing step,
It can be used to define maximum and minimum limits for height and width.
The final step is the filter to determine if the CC is qualified as a character. C is too small to satisfy other heuristics by itself.
It can be used to combine Cs or to split oversized ones.

【００４１】ステップＳ２３０において、画像プロセッサ１２０は、先のステップにおいて
規準を満たした接続構成部分を、左下端のピクセルの位置に基づいて昇順に並び
替える。画像プロセッサ１２０はピクセルの座標に基づいて並び替えを行う。接
続構成部分の並び替えられたリストは、どのＣＣがテキストのブロック（“ボッ
クス”）を形成するかを判定するために調べられる。In step S 230, the image processor 120 rearranges the connection components that meet the criteria in the previous step in ascending order based on the position of the pixel at the bottom left corner. The image processor 120 sorts based on the coordinates of the pixels. The sorted list of connection components is examined to determine which CCs form blocks of text ("boxes").

【００４２】画像プロセッサ１２０は、最初のＣＣを第１ボックスに、及び解析のための最
初の又は現ボックスとして割り当てる。画像プロセッサ１２０は、後続の各ＣＣ
を検査して、その最下端のピクセルが第１ＣＣの対応するピクセルと同一の水平
ライン（又は直ぐ近くのライン）上に位置するかを見る。即ち、後続の各ＣＣは
、その垂直方向位置が現ＣＣのものに近い場合は現テキストボックスに追加され
る。もしそうなら、それは、同一のテキストのラインに属するものと見なされる
。垂直方向の座標の差のしきい値は、固定又は可変とすることができる。好まし
くは、第２ＣＣの水平方向座標の近さはＣＣの高さの関数とする。現テキストボ
ックスへの新たな追加の候補の水平方向距離も、それが許容可能な範囲内に入る
かを見るために判定される。The image processor 120 assigns the first CC to the first box and as the first or current box for analysis. The image processor 120 uses each subsequent CC
Is checked to see if its bottommost pixel is on the same horizontal line (or the nearest line) as the corresponding pixel in the first CC. That is, each subsequent CC is added to the current text box if its vertical position is close to that of the current CC. If so, it is considered to belong to the same line of text. The vertical coordinate difference threshold can be fixed or variable. Preferably, the closeness of the horizontal coordinates of the second CC is a function of the height of the CC. The horizontal distance of the new addition candidate to the current text box is also determined to see if it falls within an acceptable range.

【００４３】ＣＣが現テキストボックスに統合するための規準を満たしていない場合は、新
たなテキストボックスが発生され、外れたＣＣは、その最初の構成要素として印
される。この処理の結果、画像内の単一のテキストラインに対して複数のテキス
トボックスが得られる。系列内の次の接続構成部分が、かなり相違する垂直方向
座標又は最後のＣＣのものより小さい水平方向座標を有する場合は、現テキスト
ボックスは水平方向横断の端部で閉じられ、新たなテキストボックスが開始する
。If the CC does not meet the criteria for merging into the current textbox, a new textbox is generated and the outlying CC is marked as its first component. This process results in multiple text boxes for a single text line in the image. If the next connected component in the sequence has a significantly different vertical coordinate or horizontal coordinate less than that of the last CC, the current text box is closed at the end of the horizontal crossing and the new text box is closed. Will start.

【００４４】各ボックスに関して、画像プロセッサ１２０は、次いで、最初の文字統合処理
により作成されたテキストボックスの各々に対して第２レベルの統合処理を実行
する。この処理は、誤って別のテキストのラインとして解釈され、従って別のボ
ックス内に配置されたテキストボックスを統合する。このことは、厳格な接続構
成部分統合規準から結果として生じるか、又は劣ったエッジ検出により結果とし
て同一の文字に対して複数のＣＣとなることにより生じ得る。For each box, the image processor 120 then performs a second level integration process on each of the text boxes created by the initial character integration process. This process is erroneously interpreted as a line of another text, thus merging text boxes placed inside another box. This can result from strict connection component integration criteria, or poor edge detection can result in multiple CCs for the same character.

【００４５】画像プロセッサ１２０は、一連の条件に関して各ボックスを、それに続くテキ
ストボックスと比較する。２つのテキストボックスに対する複数の判定条件は：ａ）一方のボックスの底部が他方の特定の垂直方向間隔内であり、該間隔は予測
されるライン間隔に対応する。また、２つのボックスの間の水平方向の間隔が、
第１ボックス内の文字の平均幅に基づく可変しきい値より小さい、ｂ）これらボックスの何れかの中心が他方のテキストボックスの領域内に位置す
る、又はｃ）第１ボックスの頂部が第２テキストボックスの底部と重なり合い、一方のボ
ックスの左辺又は右辺が、各々、他方の左辺又は右辺の数ピクセル以内である。The image processor 120 compares each box with the text box that follows it for a set of conditions. The criteria for two text boxes are: a) The bottom of one box is within a certain vertical spacing of the other, which spacing corresponds to the expected line spacing. Also, the horizontal spacing between the two boxes is
Less than a variable threshold based on the average width of the characters in the first box, b) the center of any of these boxes is located within the area of the other text box, or c) the top of the first box is second Overlapping with the bottom of the text box, the left or right side of one box is within a few pixels of the left or right side of the other, respectively.

【００４６】上記条件の何れかが満たされると、画像プロセッサ１２０は前記テキストボッ
クスのリストから第２ボックスを削除し、該ボックスを第１ボックスに統合する
。画像プロセッサ１２０は、該処理を、全てのテキストボックスが互いに判定さ
れると共に可能な限りに合成されるまで繰り返す。If any of the above conditions are met, the image processor 120 deletes the second box from the list of text boxes and merges it into the first box. The image processor 120 repeats the process until all the text boxes have been determined with respect to each other and combined as much as possible.

【００４７】ステップＳ２３５において、画像プロセッサ１２０はステップ２３０から得ら
れたテキストボックスを、これらテキストボックスが面積、幅及び高さの特定の
拘束に従う場合は、テキストラインとして受け入れる。上記テキストボックスの
各々に関して、画像プロセッサ１２０は元の画像から当該テキストボックスに対
応する副画像を抽出する。次いで、画像プロセッサ１２０は文字認識の準備のた
めに該副画像を二値化する。即ち、カラー深度は２に低減され、しきい処理は文
字が背景から適切に区別されるのを保証するような値に設定される。これは、難
しい問題であり、複雑な背景を単純化するために複数のフレームを積分する等の
多数のステップを含み得る。In step S235, the image processor 120 accepts the text boxes obtained from step 230 as text lines if the text boxes comply with certain constraints of area, width and height. For each of the text boxes, image processor 120 extracts from the original image the sub-image corresponding to that text box. The image processor 120 then binarizes the sub-image in preparation for character recognition. That is, the color depth is reduced to 2 and the thresholding is set to a value that ensures that the characters are properly distinguished from the background. This is a difficult problem and may involve multiple steps such as integrating multiple frames to simplify complex backgrounds.

【００４８】画像を二値化するためのしきいは、以下のようにして決定することができる。
画像プロセッサ１２０は、テキストボックス内のピクセルの平均グレイスケール
値（AvgFG）を計算することにより、テキストボックス画像を修正する。これは
、該画像を二値化するためのしきいとして使用される。テキストボックスの周囲
の領域（例えば、５ピクセル）の平均グレイスケール値（AvgBG）も計算される
。副画像は、AvgFGより上の全てを白として印し、AvgFGより低い全てを黒として
印す。そして、白として印されたピクセルの平均、Avg1、が計算されると共に、
黒として印されたピクセルの平均、Avg2、も計算される。The threshold for binarizing an image can be determined as follows.
The image processor 120 modifies the textbox image by calculating the average grayscale value (AvgFG) of the pixels in the textbox. This is used as a threshold for binarizing the image. The average gray scale value (AvgBG) of the area around the text box (eg, 5 pixels) is also calculated. The sub-image marks everything above AvgFG as white and everything below AvgFG as black. Then, the average of the pixels marked as white, Avg1, is calculated, and
The average of the pixels marked as black, Avg2, is also calculated.

【００４９】テキストボックスが一旦、黒及び白の（二値）画像に変換されると、画像プロ
セッサ１２０はAvg1及びAvg2をAvgGBと比較する。AvgBGに近い平均を有する領域
は背景として割り当てられ、他の領域は前景（又はテキスト）として割り当てら
れる。例えば、黒の領域の平均がAvgBGに近い場合は、該黒の領域は白に変換さ
れ、及びその逆となる。これは、テキストが、ＯＣＲプログラムへの入力に関し
て、常に一貫した値となることを保証する。次いで、画像プロセッサ１２０は、
抽出されたフレームテキストを画像テキストワークスペースに記憶し、当該処理
は処理ステップ２０５において次のフレームで以って継続する。局部しきい処理
に先立ち、テキスト解像度を向上するためにスーパー解像度ステップを実行する
ことができることに注意すべきである。Once the text box has been converted to a black and white (binary) image, the image processor 120 compares Avg1 and Avg2 with AvgGB. Areas with averages close to AvgBG are assigned as background, other areas as foreground (or text). For example, if the average of the black area is close to AvgBG, then the black area is converted to white and vice versa. This ensures that the text will always have consistent values for input to the OCR program. The image processor 120 then
The extracted frame text is stored in the image text workspace and the process continues with the next frame in process step 205. It should be noted that prior to the local threshold processing, a super resolution step can be performed to improve the text resolution.

【００５０】次に、分類を実施することができる前に、個々の文字領域が分離されなければ
ならない。個々の文字領域をテキストのラインから分離するために、例えば、文
字の高さ対幅の比、高さ及び幅に対する上限及びしきい値等の種々の発見的手法
を使用することができる。これらの発見的手法は、通常、種々の寸法的特徴に関
する許容可能な値の予測の範疇に入る。Next, the individual character regions must be separated before the classification can be performed. Various heuristics may be used to separate individual character regions from lines of text, such as, for example, character height to width ratios, upper and lower thresholds for height and width. These heuristics typically fall into the prediction of acceptable values for various dimensional features.

【００５１】接続構成部分は、元のテキストの明瞭さの不足のために、文字に対応するのを
失敗する可能性がある。ここで、図６Ａないし６Ｄを参照すると、ＣＣの区分け
が失敗した場合、水平方向のラインに沿う文字の区分けのために他のツールを使
用することができる。一例は垂直投影４２５であり、該投影は水平方向座標の関
数であると共に、その値はＸ座標に一致し且つ現テキストボックスに含まれる垂
直列内の前景ピクセルの数（及び、多分、図示のようにグレイスケール値）に比
例する。即ち、ピクセルが積分される垂直列はテキストボックスのサイズを超え
ないので、このようにして、現在の行の文字のみが測定される。この“グレイス
ケール”垂直投影４２５は窓関数４２５によって重み付けすることもでき、該窓
関数の幅は系列における次の文字に関する予測される幅に比例する。窓関数４２
５による重み付けの結果が、４２０に示されている。最小投影値を、文字の左及
び右縁を規定するために使用することもできる。The connection component may fail to correspond to a character due to lack of clarity of the original text. Referring now to FIGS. 6A-6D, if CC segmentation fails, other tools can be used for segmenting characters along horizontal lines. An example is the vertical projection 425, which is a function of the horizontal coordinate, and whose value matches the X coordinate and is the number of foreground pixels in the vertical column in the current text box (and possibly the one shown). So it is proportional to the grayscale value). That is, the vertical column in which the pixels are integrated does not exceed the size of the text box, and thus only the characters of the current line are measured. This "greyscale" vertical projection 425 can also be weighted by a window function 425, the width of which is proportional to the expected width for the next character in the sequence. Window function 42
The result of weighting by 5 is shown at 420. The minimum projection value can also be used to define the left and right edges of a character.

【００５２】図７Ａを参照すると、文字領域を分離する方法は、第１ＣＣで開始し、テキス
トボックスを介して順次進行する。ステップＳ３１０において開始し、第１の、
又は次のＣＣが選択される。ステップＳ３１２において、選択されたＣＣは、寸
法的な発見的手法に対して該ＣＣがそれらを満たすかを見るために判定される。
ＣＣに対する発見的手法の判定は、該ＣＣが完全な文字ではなさそうか、又は該
ＣＣが大き過ぎ、２以上の文字を含んでいそうであることを示すことができる。
ステップＳ３１４において該ＣＣが大き過ぎることが分かった場合、例えば上述
したグレイスケール投影のような文字を区分けする他の方法がステップＳ３１６
において適用される。ステップＳ３２２において該ＣＣが小さ過ぎることが分か
った場合は、ステップＳ３１８において次のＣＣが発見的手法に対して判定され
る。この結果が、ステップＳ３２０において、それに続くＣＣも小さ過ぎること
を示す場合、現及びそれに続くＣＣがステップＳ３２６において統合され、全て
の文字領域が分離されるまで、流れはステップＳ３１０に戻る。上記それに続く
ＣＣが小さ過ぎない場合は、現ＣＣはステップＳ３２４で破棄され、流れはステ
ップＳ３１０に戻る。Referring to FIG. 7A, the method for separating character regions starts at the first CC and proceeds sequentially through the text box. Starting in step S310, the first,
Alternatively, the next CC is selected. In step S312, the selected CCs are determined to see if they meet them for dimensional heuristics.
A heuristic determination on a CC may indicate that the CC is not likely to be a complete character or that the CC is too large and likely to contain more than one character.
If it is found in step S314 that the CC is too large, another method of segmenting characters, such as the grayscale projection described above, is used in step S316.
Applied in. If the CC is found to be too small in step S322, then the next CC is determined for the heuristic in step S318. If the result shows in step S320 that the following CC is also too small, the flow returns to step S310 until the current and subsequent CCs have been merged in step S326 and all character regions have been separated. If the following CC is not too small, then the current CC is discarded in step S324 and flow returns to step S310.

【００５３】図７Ｂを参照すると、文字を区分けする他の方法は、発見的手法に失敗した代
替文字領域を待避し、こえら代替物の分類を試みる。分類に際し、最高の信頼性
レベルを達成する代替物が選択される。この場合、他の文字領域はそれに従って
処理される。例えば、２つの統合されたＣＣに対応する画像が高い信頼性尺度で
分類された場合、第１ＣＣが統合されたＣＣに対応するフィールドは、最早、分
離した文字フィールドとしては扱われない。ステップＳ３３０において、第１の
、又は次のＣＣが選択される。ステップＳ３３２において、選択されたＣＣは、
寸法的な発見的手法に対して該ＣＣがそれらを満たすかを見るために判定される
。ステップＳ３３４において該ＣＣが大き過ぎることが分かった場合、文字を区
分けする他の方法がステップＳ３３６において適用される。ステップＳ３３８に
おいて該ＣＣが小さ過ぎることが分かった場合、現ＣＣ、及び次のＣＣと合成さ
れた現ＣＣが共に代替文字フィールドとして保持される。これら文字フィールド
が以下に説明するように分類のために送出されると、これら代替物の間での選択
を行うために信頼性尺度が使用される。流れは、全ての文字領域が分離されるま
で、ステップＳ３３０に戻る。ステップＳ３３６の断続処理が低信頼性尺度を生
じた場合は、大き過ぎるサイズで且つ割られたフィールドは分類に使用する代替
物として保持され、分類結果はこれら代替物の間の選択に使用される。Referring to FIG. 7B, another method of segmenting characters attempts to categorize these alternatives by evacuating alternative character regions that have failed heuristics. In the classification, the alternative that achieves the highest confidence level is selected. In this case, the other character areas are processed accordingly. For example, if the images corresponding to the two integrated CCs were classified with a high confidence measure, the field corresponding to the CC with the first CC integrated is no longer treated as a separate character field. In step S330, the first or next CC is selected. In step S332, the selected CC is
It is determined to see if the CC meets them for dimensional heuristics. If it is found in step S334 that the CC is too large, another method of segmenting characters is applied in step S336. If it is found in step S338 that the CC is too small, the current CC and the current CC combined with the next CC are both held as alternative character fields. When these character fields are sent for classification as described below, the confidence measure is used to make a choice between these alternatives. The flow returns to step S330 until all character areas have been separated. If the guttering process of step S336 results in a low confidence measure, the oversized and divided fields are retained as alternatives for classification and the classification results are used for selection between these alternatives. .

【００５４】文字に対応する領域は直線的ボックスとして規定される必要はない。これら領
域は、輪ゴム式の境界領域（任意の数の辺を持つ凸状多角形）、又は直交的に凸
状の直線的多角形（内部の２点を結ぶ全ての水平及び垂直方向線分が完全に内部
に位置するような直線的多角形）、又は予測される記号若しくは文字の重要な特
徴を略内包する何れかの他の好適な形状であり得る。The area corresponding to a character need not be defined as a linear box. These areas are rubber band boundary areas (convex polygons with an arbitrary number of sides), or orthogonal convex linear polygons (all horizontal and vertical line segments connecting two internal points are It may be a straight polygon such that it lies completely inside), or any other suitable shape that substantially encloses the important features of the predicted symbol or character.

【００５５】テキストボックス情報は完全に削除することができ、接続構成部分が候補文字
領域を識別するために直接使用されるようにすることもできることに注意すべき
である。しかしながら、そのような場合には、より多くの数の接続構成部分が、
これら構成部分がマップ（分類）されるべき特定の記号集合外となるであろうこ
とが予測される。また、上述した説明から、上述した技術は記号の分類に広く適
用することができ、テキスト的文字の分類に限定されるものではないことが明ら
かであることにも注意すべきである。It should be noted that the textbox information can be deleted altogether and the connection component can be used directly to identify the candidate character area. However, in such cases, a greater number of connection components
It is expected that these components will fall outside the particular symbol set to be mapped (classified). It should also be noted from the above description that it is clear that the techniques described above are broadly applicable to the classification of symbols and are not limited to the classification of textual characters.

【００５６】図８を参照して、全ての文字領域が一旦分離されると（ステップＳ４０５によ
り包括される）、文字は順番に分類することができる。次に、ステップＳ４１０
において、第１、又は順次の文字領域が選択される。ステップＳ４１５において
、特徴解析の準備のために、元の画像（又は、その赤の部分）の一部に何らかの
適切な画像解析が施される。例えば、該画像は二値化され（しきい処理され）、
グレイスケール画像化され、二値化され且つ細化され、等されることができる。
前処理は、使用される特徴空間に依存して変化する。Referring to FIG. 8, once all the character regions have been separated (included by step S405), the characters can be sorted in order. Next, step S410.
At, the first or sequential character area is selected. In step S415, some suitable image analysis is performed on a portion of the original image (or its red portion) in preparation for feature analysis. For example, the image is binarized (thresholded),
It can be grayscale imaged, binarized and refined, and so on.
The preprocessing varies depending on the feature space used.

【００５７】図９Ａないし９Ｄを参照して、例えば、特徴空間は或る特徴点を利用すること
ができる（以下に説明するように）。特徴点はスケルトン文字を用いて識別する
ことができ、これらを普通のビデオ文字（図９Ａ）から導出するために、画像は
二値化され（図９Ｂ）、次いで細化する（図９Ｃ）ことができる。次いで、特徴
点（図９Ｄの４６５ないし４６８）は、細化された文字４６０、４７０の角の点
４６５、曲がり４６６、交点４６７及び端点４６８として導出することができる
。この種の画像処理は、以下に述べる角度ヒストグラム特徴空間に好適である。
サイズ不変モーメントを計算するためには、低い程度の画像処理しか必要とされ
ない。他の特徴点定義システムも同様に使用することができることに注意すべき
である。With reference to FIGS. 9A-9D, for example, the feature space may utilize certain feature points (as described below). The minutiae can be identified using skeleton characters, and the image can be binarized (FIG. 9B) and then thinned (FIG. 9C) to derive them from ordinary video characters (FIG. 9A). You can The feature points (465 to 468 in FIG. 9D) can then be derived as corner points 465, bends 466, intersections 467 and endpoints 468 of the thinned characters 460, 470. This type of image processing is suitable for the angle histogram feature space described below.
Only a low degree of image processing is required to calculate the size invariant moment. It should be noted that other minutiae definition systems can be used as well.

【００５８】再び図８を参照して、元の文字には特徴ベクトルを定義するために種々の異な
る解析を施すことができ、該ベクトルは適切に訓練された逆伝搬ニューラルネッ
トワーク（ＢＰＮＮ）の入力端に供給することができる。サイズ不変モーメント
を使用する技術に関しては、細化されていない又は細化された文字を使用するこ
とができる。ステップＳ４２０において、選択された特徴ベクトルが適切な画像
解析により発生される。種々のこれらのものを使用することができる。本特許が
関連するアプリケーションに関して、多数の異なる特徴空間が定義されている。
以下に詳述する定義された特徴空間は、サイズ及び回転不変であり、ＢＰＮＮ分
類器を用いたビデオ文字分類に特に適していると考えられる。Referring again to FIG. 8, the original character can be subjected to various different analyzes to define a feature vector, which is the input of a properly trained backpropagation neural network (BPNN). Can be supplied to the edge. For techniques that use size-invariant moments, unthinned or thinned characters can be used. In step S420, the selected feature vector is generated by appropriate image analysis. A variety of these can be used. A number of different feature spaces have been defined for the applications to which this patent pertains.
The defined feature space detailed below is size and rotation invariant and is considered to be particularly suitable for video character classification using the BPNN classifier.

【００５９】第１特徴空間が、図９Ａないし９Ｄにより示されるように、細化された文字の
特徴点から導出される。図１０Ａ及び１０Ｂを参照して、先ず、ドローネーの三
角形分割（Delaunay triangulation：図１０Ａ）又はボロノイ図（Voronoy diag
ram：図１０Ｂ）が特徴点１２から導出される。画像プロセッサ１２０は上記三
角形分割を実行し、次いで各三角形１〜６に対して内角の目録を発生する。画像
プロセッサは、次いで、この目録を使用して図１１Ａに示すような角度のヒスト
グラムを発生する。該ヒストグラムは、単に、上記三角形分割により規定された
三角形１〜６の集合における所与のサイズ範囲の角度Ａ、Ｂ及びＣの頻度を表す
。他の三角形分割方法又は多角形発生方法を使用することもできることに注意さ
れたい。例えば、図１０Ｂを参照すると、ボロノイ多角形１７及び１８の集合を
、各々が該ボロノイ図の頂点１４に関連する角度Ａ’、Ｂ’及びＣ’の組を定義
するために使用することができる。結果として得られる角度ヒストグラムは、特
徴点が導出された特定の文字に関する特徴ベクトルとして働く。A first feature space is derived from the feature points of the thinned characters, as shown by FIGS. 9A-9D. 10A and 10B, first, Delaunay triangulation (FIG. 10A) or Voronoi diagram (Voronoy diag)
ram: FIG. 10B) is derived from the feature points 12. Image processor 120 performs the triangulation described above and then generates a list of interior angles for each triangle 1-6. The image processor then uses this inventory to generate an angle histogram as shown in FIG. 11A. The histogram simply represents the frequency of angles A, B and C in a given size range in the set of triangles 1-6 defined by the above triangulation. Note that other triangulation methods or polygon generation methods can also be used. For example, referring to FIG. 10B, a set of Voronoi polygons 17 and 18 can be used to define a set of angles A ′, B ′ and C ′ each associated with a vertex 14 of the Voronoi diagram. . The resulting angle histogram acts as a feature vector for the particular character from which the feature points were derived.

【００６０】他のサイズ及び回転不変な特徴、例えば、水平ラインの数、交点の数、端点の
数、孔、反曲点及び中間点等、を上記特徴空間に追加することができる。上記角
度ヒストグラムに対する他の変形例は、各三角形の内角のうちの最も大きな（又
は最も小さな）２つのみの使用である。上記角度ヒストグラムの更に他の変形例
は、一次元的角度ヒストグラムの代わりに二次元的角度ヒストグラムを使用する
ことである。例えば、図１１Ｂを参照すると、各三角形に関する最も大きな（又
は最も小さな）角度の対が、ドローネーの三角形分割における各三角形に関して
（又はボロノイ図の各頂点に関して）整列された（サイズにより整列された）対
を定義する。整列された各対における第１要素は当該マトリクスの第１次元に関
して使用され、第２要素は当該マトリクスの第２次元に関して使用される。この
ようにして、角度間の関連が、ＢＰＮＮ分類器を用いた訓練及び分類のための情
報として保存される。Other size and rotation invariant features can be added to the feature space, such as the number of horizontal lines, the number of intersections, the number of end points, holes, inflection points and midpoints. Another variation on the angle histogram is the use of only the two largest (or smallest) interior angles of each triangle. Yet another variation of the angle histogram is to use a two-dimensional angle histogram instead of a one-dimensional angle histogram. For example, referring to FIG. 11B, the largest (or smallest) pair of angles for each triangle is aligned (or aligned by size) for each triangle in Delaunay triangulation (or for each vertex in the Voronoi diagram). Define a pair. The first element in each aligned pair is used for the first dimension of the matrix and the second element is used for the second dimension of the matrix. In this way, the association between angles is preserved as information for training and classification with the BPNN classifier.

【００６１】ビデオ文字ＢＰＮＮ分類器にとり特に適していると考えられる更に他の特徴空
間は、サイズ不変モーメントのアレイである。これらのモーメントは、下記の式
により定義される。当該状況において使用することができる多数の別個のモーメ
ントが存在するが、特定の少数が、このアプリケーションのために選択される。
先ず、マスｉバー、ｊバーの中心に一致するピクセル位置のピクセルインデック
スは、Yet another feature space that may be particularly suitable for video character BPNN classifiers is an array of size invariant moments. These moments are defined by the following equations. There are many distinct moments that can be used in that situation, but a certain minority is chosen for this application.
First, the pixel index of the pixel position corresponding to the center of the mass i bar and j bar is

【式２】により与えられ、ここで、Ｂ[i][j]は、しきい処理された画像のｉ，ｊ番目のピ
クセルが前景ピクセルの場合は１である一方、それ以外の場合は０であり、Ａは
次の式により与えられる前景ピクセルの総面積である。[Formula 2] Where B [i] [j] is 1 if the i, jth pixel of the thresholded image is a foreground pixel, and 0 otherwise. Is the total area of the foreground pixels given by

【式３】並進不変モーメントは、[Formula 3] The translation invariant moment is

【式４】により与えられ、ここで、Ｍ_p,qは下記の式により与えられる文字画像のｐ，ｑ
番目の生のモーメントである。[Formula 4] Where M _{p, q} is _{p, q} of the character image given by the following equation:
It is the th raw moment.

【式５】ＢＰＮＮに入力するために選択される不変モーメントは、[Formula 5] The invariant moment selected for input to the BPNN is

【式６】である。[Formula 6] Is.

【００６２】再び図８を参照して、ステップＳ４２５においては、各特徴ベクトルが訓練さ
れたＢＰＮＮに供給され、該ＢＰＮＮは種々の候補分類と、希望的には入力に応
じて１つの非常に強い候補とを出力する。複数の候補文字が存在する場合は、Ｂ
ＰＮＮにより出力される確率と、推定される言語及び前後関係に対する使用デー
タの頻度とを組み合わせることによりステップＳ４３０において最良の推定を行
う。斯様なデータは、例えばテレビジョン宣伝の写し、印刷物、インターネット
のストリーミング又はダウンロードされたファイル等の異なる型式の題材から集
めることができる。組み合わせる一つの方法は、ＢＰＮＮにより出力される確率
を、使用頻度統計に関連する対応する確率により重み付けすることである。Referring again to FIG. 8, in step S425, each feature vector is provided to the trained BPNN, which is a very strong candidate responsive to various candidate classifications and, optionally, inputs. Output candidates and. B if there are multiple candidate characters
The best estimate is made in step S430 by combining the probability output by the PNN and the frequency of the usage data for the estimated language and context. Such data can be gathered from different types of material such as, for example, transcripts of television advertisements, printed matter, internet streaming or downloaded files. One way to combine is to weight the probabilities output by the BPNN by the corresponding probabilities associated with the usage statistics.

【００６３】当業者にとっては、本発明は上述した解説的実施例の細部に限定されるもので
はなく、且つ、本発明はその趣旨又は必須の属性から逸脱すること無しに他の特
別な形態で実施化することもできることは明であろう。例えば、上記に提示した
テキスト解析は、水平に整列されたテキストに関して好みを記載した。同様な方
法が、垂直に整列されたテキスト、曲線に沿うテキスト等の他の整列にも適用す
ることができることは明らかである。For the person skilled in the art, the present invention is not limited to the details of the illustrative embodiments described above, and the invention can be embodied in other specific forms without departing from its spirit or essential attributes. It will be clear that it can also be implemented. For example, the text analysis presented above described preferences for horizontally aligned text. Obviously, the same method can be applied to other alignments such as vertically aligned text, text along a curve, etc.

【００６４】従って、上記実施例は全ての点において解説的であって限定するものではない
と見なされるべきであり、本発明の範囲は上述した説明ではなく添付請求項によ
り示され、請求項の均等の意味及び範囲内に入る全ての変更は、従って、請求項
に含まれることを意図するものである。Accordingly, the above embodiments should be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than the foregoing description, and the claims All changes that come within the meaning and range of equivalency are therefore intended to be covered by the appended claims.

[Brief description of drawings]

【図１】図１は、本発明を実施するために使用することができる装置を示すブロック図
である。FIG. 1 is a block diagram showing an apparatus that can be used to implement the present invention.

【図２】図２は、本発明の一実施例による文字分類方法を示すフローチャートである。[Fig. 2] FIG. 2 is a flowchart showing a character classification method according to an embodiment of the present invention.

【図３Ａ】図３Ａは、本発明の一実施例により分類可能な情報を含むビデオ画面における
テキスト領域を示す。FIG. 3A illustrates a text area on a video screen containing information that can be classified according to one embodiment of the present invention.

【図３Ｂ】図３Ｂも、本発明の一実施例により分類可能な情報を含むビデオ画面における
テキスト領域を示す。FIG. 3B also shows a text area on a video screen containing information that can be classified according to one embodiment of the present invention.

【図４Ａ】図４Ａは、ビデオフレームの捕捉されたデジタル画像からのテキストセグメン
トの外観を示す。FIG. 4A shows the appearance of a text segment from a captured digital image of a video frame.

【図４Ｂ】図４Ｂは、エッジ検出フィルタ処理の後の上記テキストセグメントを示す。FIG. 4B FIG. 4B shows the text segment after edge detection filtering.

【図４Ｃ】図４Ｃは、エッジ検出の間又はエッジ検出前のフィルタ処理の幾つかの段階の
効果を示すが、これらは本発明に関連する思想を説明するために示されたもので
あって、中間結果を実際に示すものではないことに注意されたい。FIG. 4C shows the effect of several stages of filtering during or before edge detection, which are presented to illustrate the concepts associated with the present invention. , Please note that it does not really show the intermediate results.

【図５Ａ】図５Ａは、本発明の一実施例によるエッジフィルタ処理の効果を示す。FIG. 5A FIG. 5A shows the effect of edge filtering according to an embodiment of the present invention.

【図５Ｂ】図５Ｂも、本発明の一実施例によるエッジフィルタ処理の効果を示す。FIG. 5B FIG. 5B also shows the effect of the edge filtering process according to the embodiment of the present invention.

【図５Ｃ】図５Ｃは、本発明に使用することができるギャップ閉鎖アルゴリズムの一例を
示す。FIG. 5C shows an example of a gap closing algorithm that can be used in the present invention.

【図６Ａ】図６Ａは、本発明の一実施例によるテキストラインのセグメント化の技術を示
す。FIG. 6A illustrates a technique for segmenting a text line according to one embodiment of the invention.

【図６Ｂ】図６Ｂも、本発明の一実施例によるテキストラインのセグメント化の技術を示
す。FIG. 6B also illustrates a technique of text line segmentation according to one embodiment of the present invention.

【図６Ｃ】図６Ｃも、本発明の一実施例によるテキストラインのセグメント化の技術を示
す。FIG. 6C also illustrates a technique for segmenting a text line according to one embodiment of the present invention.

【図６Ｄ】図６Ｄも、本発明の一実施例によるテキストラインのセグメント化の技術を示
す。FIG. 6D also illustrates a technique for segmenting text lines according to one embodiment of the invention.

【図７Ａ】図７Ａは、本発明の一実施例のフィルタ処理による接続構成部分の作成及び管
理の技術を示すフローチャートである。FIG. 7A is a flowchart showing a technique of creating and managing a connection constituent part by a filtering process according to an embodiment of the present invention.

【図７Ｂ】図７Ｂも、本発明の一実施例のフィルタ処理による接続構成部分の作成及び管
理の技術を示すフローチャートである。FIG. 7B is also a flowchart showing a technique of creating and managing a connection constituent part by filtering processing according to an embodiment of the present invention.

【図８】図８は、本発明の一実施例による文字分類方法を示すフローチャートである。[Figure 8] FIG. 8 is a flowchart showing a character classification method according to an embodiment of the present invention.

【図９Ａ】図９Ａは、特徴ベクトル前駆体を導出するためのセグメント化された文字のフ
ィルタ処理を示す。FIG. 9A shows segmented character filtering for deriving feature vector precursors.

【図９Ｂ】図９Ｂも、特徴ベクトル前駆体を導出するためのセグメント化された文字のフ
ィルタ処理を示す。FIG. 9B also shows segmented character filtering to derive feature vector precursors.

【図９Ｃ】図９Ｃも、特徴ベクトル前駆体を導出するためのセグメント化された文字のフ
ィルタ処理を示す。FIG. 9C also shows segmented character filtering to derive feature vector precursors.

【図９Ｄ】図９Ｄも、特徴ベクトル前駆体を導出するためのセグメント化された文字のフ
ィルタ処理を示す。FIG. 9D also shows segmented character filtering to derive feature vector precursors.

【図１０Ａ】図１０Ａは、本発明の一実施例による文字分類処理の画像フィルタ処理ステッ
プにおけるドローネーの三角形分割段階を示す。FIG. 10A illustrates Delaunay triangulation steps in the image filtering step of the character classification process according to one embodiment of the present invention.

【図１０Ｂ】図１０Ｂは、本発明の一実施例による文字分類処理の画像フィルタ処理ステッ
プにおけるボロノイ図段階を示す。FIG. 10B shows Voronoi diagram steps in the image filtering step of the character classification process according to one embodiment of the present invention.

【図１１Ａ】図１１Ａは、本発明の一実施例による角度ヒストグラム型特徴空間を示す。FIG. 11A FIG. 11A illustrates an angle histogram type feature space according to an embodiment of the present invention.

【図１１Ｂ】図１１Ｂも、本発明の一実施例による角度ヒストグラム型特徴空間を示す。FIG. 11B FIG. 11B also shows an angle histogram type feature space according to an embodiment of the present invention.

[Explanation of symbols]

１００…画像テキスト解析システム１１０…ビデオ処理装置１２０…画像プロセッサ１３０…ＲＡＭ１３２…画像テキストワークスペース１３４…テキスト解析コントローラ１４０…記憶部１５０…ユーザＩ／Ｏカード１６０…ビデオカード１７０…Ｉ／Ｏバッファ１７５…プロセッサバス１８０…ビデオソース１８５…モニタ 100 ... Image text analysis system 110 ... Video processing device 120 ... Image processor 130 ... RAM 132 ... Image text workspace 134 ... Text analysis controller 140 ... Storage unit 150 ... User I / O card 160 ... Video card 170 ... I / O buffer 175 ... Processor bus 180 ... video source 185 ... Monitor

───────────────────────────────────────────────────── フロントページの続き (72)発明者ディミトロヴァネベンカオランダ国 5656 アーアーアインドーフェンプロフホルストラーン６Ｆターム(参考） 5B029 AA02 BB02 CC28 DD05 EE08─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Dimitrova Nebenka Netherlands 5656 aer ind Fenprof Holstraan 6 F-term (reference) 5B029 AA02 BB02 CC28 DD05 EE08

Claims

[Claims]

1. A method for classifying symbols in an image data stream containing symbols, the method comprising: training a backpropagation neural network (BPNN) into a feature space containing at least two shape-dependent features; For capturing an image from a stream, detecting an image region extending in the same space as a symbol to be classified embedded in the video data stream, obtaining a feature vector from the image based on the feature space, and classifying the symbol And providing the feature vector to the BPNN.

2. The method of claim 1, wherein the at least two shape dependent features include size, translation and rotation invariant moments.

3. The method of claim 1, further comprising identifying feature points in the image, wherein the at least two shape dependent features are incident angle magnitudes that appear in a triangulation of the feature points. A method that includes

4. The method of claim 1, further comprising identifying feature points in the image and forming at least one of a Delaunay triangulation and a Voronoi diagram based on the feature points, The method wherein the at least two shape dependent features include a histogram representing angles of incidence that appear in at least one of the Delaunay triangulation and Voronoi diagram.

5. An apparatus for classifying symbols in an image data stream containing symbols, the image data storage unit comprising an input section and an output section connected to capture data from the image data stream, and An image processor connected to the output of the image data storage unit and programmed to detect an image spread in the same space as the symbol to be classified embedded in the image data stream, Includes a backpropagation neural network (BPNN) trained in a feature space, the feature space includes at least one shape-dependent feature, the image processor obtains a feature vector from the image based on the feature space, and Programmed to provide the feature vector to the BPNN to classify symbols, Location.

6. The apparatus according to claim 5, wherein the image processor identifies feature points in the image and forms at least one of Delaunay triangulation and Voronoi diagram based on the feature points. Wherein the derivation of the feature points comprises refining the image in binary form, wherein the image processor has at least two shape dependent features in at least one of the Delaunay triangulation and Voronoi diagram. A device further programmed to include a histogram representing the incident angles of incidence.

7. A computer program product for an image data processing device, the computer program product causing the image data processing device to perform the method of claim 1 when loaded into the image data processing device. A computer program product having a set of instructions.