EP1917626A1 - Procede pour retrouver des blocs de texte dans des documents - Google Patents

Procede pour retrouver des blocs de texte dans des documents

Info

Publication number
EP1917626A1
EP1917626A1 EP06776758A EP06776758A EP1917626A1 EP 1917626 A1 EP1917626 A1 EP 1917626A1 EP 06776758 A EP06776758 A EP 06776758A EP 06776758 A EP06776758 A EP 06776758A EP 1917626 A1 EP1917626 A1 EP 1917626A1
Authority
EP
European Patent Office
Prior art keywords
text block
text
features
feature data
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP06776758A
Other languages
German (de)
English (en)
Inventor
Katja Worm
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Siemens AG
Original Assignee
Siemens AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens AG filed Critical Siemens AG
Publication of EP1917626A1 publication Critical patent/EP1917626A1/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/146Aligning or centring of the image pick-up or image-field
    • G06V30/147Determination of region of interest
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/42Document-oriented image-based pattern recognition based on the type of document
    • G06V30/424Postal images, e.g. labels or addresses on parcels or postal envelopes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • the invention relates to a method for retrieving text blocks in documents according to the preamble of claim 1.
  • the types of postal items to be sorted are of particular importance.
  • the former can be easily distinguished by known methods since they are e.g. strongly differentiated by their color.
  • mass mailings of one type have e.g. an equal color. They usually have the same elements as symbols, logos and frankings and differ only in the area of the recipient address. This results in the need for the use of address features, e.g. to carry out a complex word recognition.
  • the method should be optimally suited for the sorting of postal bulk mail to be sorted.
  • a text block offers a great deal of potential for description by means of suitable features and thus to generate an associated feature data record which unambiguously characterizes it and differentiates it from other text blocks.
  • feature data record which unambiguously characterizes it and differentiates it from other text blocks.
  • no content-related interpretation of the text block and thus no comparison should be made on the basis of the literal text content.
  • Extracts text blocks relating to graphic properties of the entire text block are much easier and faster to recognize, than in an interpretation of texts. For example, there is a size of the text block, a location of the text block within the Dru ⁇ sugar certificate, a degree of filling of the text block, a number of lines in the text block size of gaps Zvi ⁇ rule lines in the text block and / or a font size of rows in text block.
  • one or more fine structure related features of the text block may be extracted, which now refer to graphic properties of individual lines in the text block.
  • the features used here can be selected from the following: number of context areas within a line, frequency of frequency of connected areas, color value transitions in a line and possibly their matrix form with several lines and / or line profiles.
  • feature vectors are used as feature data records which are used for sorting / comparing e.g. two text blocks are retrieved in the identification process.
  • the structure-related features of a text block of a printed product are arranged in a feature data record such that a comparison between two features of the same category remains feasible.
  • the feature data sets are compared with one another according to their assignment in order to identify the text blocks as a function of the coarse or possibly the fine classification.
  • 3B shows a detection of the same address field in a new transmission with a missing line
  • 3C a reassignment of lines.
  • FIG. 1 shows what Feature data set under a line and a line space in a decomposition of an address field in full extent (above) in three lines 1, 2, 3 (below) is understood.
  • the font size e.g., largest letter of the line
  • FIG. 1 shows what Feature data set under a line and a line space in a decomposition of an address field in full extent (above) in three lines 1, 2, 3 (below) is understood.
  • the font size e.g., largest letter of the line
  • a rough analysis or classification of the similarity of two texts can be carried out. They are easy, fast and reliable to detect and have negligible memory requirements.
  • Text blocks that are similar based on these criteria are examined for their similarity by more complex procedures. For this purpose, on the one hand, the structure of a text and, on the other hand, the text lines occurring are examined in more detail. With the help of the detected lines, the following features can be determined with a second finer classification:
  • the first features described can be studied using simple distance measures and decision-making procedures. Row profiles, however, require a more complex distance measure, since the vectors are heavily dependent on the detected text block. Slight shifts lead to changes in the feature data record. In order to determine the distance, a distance measure is therefore needed which takes into account the influence of such displacements.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Character Discrimination (AREA)
  • Sorting Of Articles (AREA)

Abstract

L'invention concerne un procédé pour retrouver des blocs de texte dans des documents, de préférence pour des envois postaux à trier, tels que des envois en nombre. L'objectif de cette invention est de retrouver ou d'identifier ces blocs de texte dans des documents de n'importe quel type à l'aide d'enregistrements de signes caractéristiques de blocs de texte de référence. A cet effet, des signes du bloc de texte liés à la structure sont extraits comme signes caractéristiques et comparés à des signes d'un enregistrement de signes d'un bloc de texte de référence, de sorte que des signes similaires peuvent être reconnus le plus simplement possible parmi plusieurs blocs de texte. Une première extraction de signes liés à la structure peut être effectuée p. ex. par une décomposition d'un bloc de texte en plusieurs lignes, dont la hauteur ou les interlignes sont mémorisés dans un enregistrement de signes d'un envoi. Ainsi, des similitudes peuvent être recherchées parmi différents blocs de texte par comparaison des enregistrements de signes.
EP06776758A 2005-08-26 2006-08-11 Procede pour retrouver des blocs de texte dans des documents Withdrawn EP1917626A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DE102005040687A DE102005040687A1 (de) 2005-08-26 2005-08-26 Verfahren zum Wiederauffinden von Textblöcken in Dokumenten
PCT/EP2006/007939 WO2007022877A1 (fr) 2005-08-26 2006-08-11 Procede pour retrouver des blocs de texte dans des documents

Publications (1)

Publication Number Publication Date
EP1917626A1 true EP1917626A1 (fr) 2008-05-07

Family

ID=37398939

Family Applications (1)

Application Number Title Priority Date Filing Date
EP06776758A Withdrawn EP1917626A1 (fr) 2005-08-26 2006-08-11 Procede pour retrouver des blocs de texte dans des documents

Country Status (6)

Country Link
US (1) US20090252415A1 (fr)
EP (1) EP1917626A1 (fr)
CN (1) CN101263512A (fr)
CA (1) CA2620180A1 (fr)
DE (1) DE102005040687A1 (fr)
WO (1) WO2007022877A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102331982B (zh) * 2011-07-28 2014-03-05 深圳万兴信息科技股份有限公司 自适应窗体大小的pdf文档显示方法、系统及移动终端
US10095946B2 (en) * 2016-07-07 2018-10-09 Lockheed Martin Corporation Systems and methods for strike through detection

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5848184A (en) * 1993-03-15 1998-12-08 Unisys Corporation Document page analyzer and method
EP0702322B1 (fr) * 1994-09-12 2002-02-13 Adobe Systems Inc. Méthode et appareil pour identifier des mots décrits dans un document électronique portable
US5995659A (en) 1997-09-09 1999-11-30 Siemens Corporate Research, Inc. Method of searching and extracting text information from drawings
JPH11238097A (ja) * 1998-02-20 1999-08-31 Toshiba Corp 郵便物宛先読取装置及び宛先読取方法
JP2004326491A (ja) * 2003-04-25 2004-11-18 Canon Inc 画像処理方法
JP4855698B2 (ja) * 2005-03-22 2012-01-18 株式会社東芝 宛先認識装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO2007022877A1 *

Also Published As

Publication number Publication date
CN101263512A (zh) 2008-09-10
CA2620180A1 (fr) 2007-03-01
DE102005040687A1 (de) 2007-03-01
WO2007022877A1 (fr) 2007-03-01
US20090252415A1 (en) 2009-10-08

Similar Documents

Publication Publication Date Title
EP1665132B1 (fr) Procede et systeme de detection de donnees provenant de plusieurs documents lisibles par ordinateur
DE60308025T3 (de) Identifikationsmarkieren von poststücken durch bildsignatur und zugehörige postbearbeitungsmaschine
DE69814104T2 (de) Aufteilung von texten und identifizierung von themen
DE10027178B4 (de) Magnetstreifen-Authentifizierungs-Verifizierungssystem
EP0980293B1 (fr) Procede et dispositif de reconnaissance d'information de distribution de courrier
DE19705757A1 (de) Verfahren und Gerät für das Design eines hoch-zuverlässigen Mustererkennungs-Systems
DE19511470C1 (de) Verfahren zur Ermittlung eines Referenzschriftzuges anhand einer Menge von schreiberidentischen Musterschriftzügen
DE69926280T2 (de) Verfahren zur Erkennung von Adressen und Briefverarbeitungsvorrichtung
WO1999016559A1 (fr) Procede et dispositif de reconnaissance des donnees de distribution figurant sur les envois postaux
DE2435889B2 (de) Verfahren und einrichtung zur unterscheidung von zeichengruppen
WO2007022880A1 (fr) Procede d'identification d'envois a trier
EP1917626A1 (fr) Procede pour retrouver des blocs de texte dans des documents
EP2259210A2 (fr) Procédé et dispositif destinés à l'analyse d'une base de données
DE102006008936A1 (de) Verfahren zum Erkennen von Objekten und Objekterkennungssystem
EP2273383A1 (fr) Procédé et dispositif de recherche automatique de documents dans un dispositif de stockage de données
EP2084652A1 (fr) Procédé et dispositif d'identification d'objets
EP1389493A1 (fr) Procédé et dispositif pour le marquage automatique d'un champ d'adresse
DE3414455A1 (de) Verfahren und vorrichtung zum lesen und speichern von information
EP1159705B1 (fr) Procede de lecture d'entrees de documents et d'adresses
DE19820353C2 (de) Verfahren und Vorrichtung zur Erkennung eines Musters auf einer Vorlage
EP0731955B1 (fr) Procede et dispositif de saisie et d'identification automatique d'informations enregistrees
WO2003079273A2 (fr) Procede et dispositif de lecture d'adresses d'envois
DE102009050681A1 (de) Verfahren und Vorrichtung zum Erkennen und Klassifizieren von Dokumentteilen eines rechnerverfügbaren Dokuments durch schrittweises Lernen aus mehreren Trainingsmengen
DE19635351C2 (de) Verfahren zur Formatkonvertierung
DE102009013390A1 (de) Verfahren und Vorrichtung zum Klassifizieren eines physikalischen Objekts mittels eines parametrierten Klassifikators

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20080226

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC NL PL PT RO SE SI SK TR

17Q First examination report despatched

Effective date: 20081001

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20090415