EP1917626A1 - Procede pour retrouver des blocs de texte dans des documents - Google Patents
Procede pour retrouver des blocs de texte dans des documentsInfo
- Publication number
- EP1917626A1 EP1917626A1 EP06776758A EP06776758A EP1917626A1 EP 1917626 A1 EP1917626 A1 EP 1917626A1 EP 06776758 A EP06776758 A EP 06776758A EP 06776758 A EP06776758 A EP 06776758A EP 1917626 A1 EP1917626 A1 EP 1917626A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- text block
- text
- features
- feature data
- block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/146—Aligning or centring of the image pick-up or image-field
- G06V30/147—Determination of region of interest
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/42—Document-oriented image-based pattern recognition based on the type of document
- G06V30/424—Postal images, e.g. labels or addresses on parcels or postal envelopes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Definitions
- the invention relates to a method for retrieving text blocks in documents according to the preamble of claim 1.
- the types of postal items to be sorted are of particular importance.
- the former can be easily distinguished by known methods since they are e.g. strongly differentiated by their color.
- mass mailings of one type have e.g. an equal color. They usually have the same elements as symbols, logos and frankings and differ only in the area of the recipient address. This results in the need for the use of address features, e.g. to carry out a complex word recognition.
- the method should be optimally suited for the sorting of postal bulk mail to be sorted.
- a text block offers a great deal of potential for description by means of suitable features and thus to generate an associated feature data record which unambiguously characterizes it and differentiates it from other text blocks.
- feature data record which unambiguously characterizes it and differentiates it from other text blocks.
- no content-related interpretation of the text block and thus no comparison should be made on the basis of the literal text content.
- Extracts text blocks relating to graphic properties of the entire text block are much easier and faster to recognize, than in an interpretation of texts. For example, there is a size of the text block, a location of the text block within the Dru ⁇ sugar certificate, a degree of filling of the text block, a number of lines in the text block size of gaps Zvi ⁇ rule lines in the text block and / or a font size of rows in text block.
- one or more fine structure related features of the text block may be extracted, which now refer to graphic properties of individual lines in the text block.
- the features used here can be selected from the following: number of context areas within a line, frequency of frequency of connected areas, color value transitions in a line and possibly their matrix form with several lines and / or line profiles.
- feature vectors are used as feature data records which are used for sorting / comparing e.g. two text blocks are retrieved in the identification process.
- the structure-related features of a text block of a printed product are arranged in a feature data record such that a comparison between two features of the same category remains feasible.
- the feature data sets are compared with one another according to their assignment in order to identify the text blocks as a function of the coarse or possibly the fine classification.
- 3B shows a detection of the same address field in a new transmission with a missing line
- 3C a reassignment of lines.
- FIG. 1 shows what Feature data set under a line and a line space in a decomposition of an address field in full extent (above) in three lines 1, 2, 3 (below) is understood.
- the font size e.g., largest letter of the line
- FIG. 1 shows what Feature data set under a line and a line space in a decomposition of an address field in full extent (above) in three lines 1, 2, 3 (below) is understood.
- the font size e.g., largest letter of the line
- a rough analysis or classification of the similarity of two texts can be carried out. They are easy, fast and reliable to detect and have negligible memory requirements.
- Text blocks that are similar based on these criteria are examined for their similarity by more complex procedures. For this purpose, on the one hand, the structure of a text and, on the other hand, the text lines occurring are examined in more detail. With the help of the detected lines, the following features can be determined with a second finer classification:
- the first features described can be studied using simple distance measures and decision-making procedures. Row profiles, however, require a more complex distance measure, since the vectors are heavily dependent on the detected text block. Slight shifts lead to changes in the feature data record. In order to determine the distance, a distance measure is therefore needed which takes into account the influence of such displacements.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Character Discrimination (AREA)
- Sorting Of Articles (AREA)
Abstract
L'invention concerne un procédé pour retrouver des blocs de texte dans des documents, de préférence pour des envois postaux à trier, tels que des envois en nombre. L'objectif de cette invention est de retrouver ou d'identifier ces blocs de texte dans des documents de n'importe quel type à l'aide d'enregistrements de signes caractéristiques de blocs de texte de référence. A cet effet, des signes du bloc de texte liés à la structure sont extraits comme signes caractéristiques et comparés à des signes d'un enregistrement de signes d'un bloc de texte de référence, de sorte que des signes similaires peuvent être reconnus le plus simplement possible parmi plusieurs blocs de texte. Une première extraction de signes liés à la structure peut être effectuée p. ex. par une décomposition d'un bloc de texte en plusieurs lignes, dont la hauteur ou les interlignes sont mémorisés dans un enregistrement de signes d'un envoi. Ainsi, des similitudes peuvent être recherchées parmi différents blocs de texte par comparaison des enregistrements de signes.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE102005040687A DE102005040687A1 (de) | 2005-08-26 | 2005-08-26 | Verfahren zum Wiederauffinden von Textblöcken in Dokumenten |
PCT/EP2006/007939 WO2007022877A1 (fr) | 2005-08-26 | 2006-08-11 | Procede pour retrouver des blocs de texte dans des documents |
Publications (1)
Publication Number | Publication Date |
---|---|
EP1917626A1 true EP1917626A1 (fr) | 2008-05-07 |
Family
ID=37398939
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP06776758A Withdrawn EP1917626A1 (fr) | 2005-08-26 | 2006-08-11 | Procede pour retrouver des blocs de texte dans des documents |
Country Status (6)
Country | Link |
---|---|
US (1) | US20090252415A1 (fr) |
EP (1) | EP1917626A1 (fr) |
CN (1) | CN101263512A (fr) |
CA (1) | CA2620180A1 (fr) |
DE (1) | DE102005040687A1 (fr) |
WO (1) | WO2007022877A1 (fr) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102331982B (zh) * | 2011-07-28 | 2014-03-05 | 深圳万兴信息科技股份有限公司 | 自适应窗体大小的pdf文档显示方法、系统及移动终端 |
US10095946B2 (en) * | 2016-07-07 | 2018-10-09 | Lockheed Martin Corporation | Systems and methods for strike through detection |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5848184A (en) * | 1993-03-15 | 1998-12-08 | Unisys Corporation | Document page analyzer and method |
EP0702322B1 (fr) * | 1994-09-12 | 2002-02-13 | Adobe Systems Inc. | Méthode et appareil pour identifier des mots décrits dans un document électronique portable |
US5995659A (en) | 1997-09-09 | 1999-11-30 | Siemens Corporate Research, Inc. | Method of searching and extracting text information from drawings |
JPH11238097A (ja) * | 1998-02-20 | 1999-08-31 | Toshiba Corp | 郵便物宛先読取装置及び宛先読取方法 |
JP2004326491A (ja) * | 2003-04-25 | 2004-11-18 | Canon Inc | 画像処理方法 |
JP4855698B2 (ja) * | 2005-03-22 | 2012-01-18 | 株式会社東芝 | 宛先認識装置 |
-
2005
- 2005-08-26 DE DE102005040687A patent/DE102005040687A1/de not_active Withdrawn
-
2006
- 2006-08-11 EP EP06776758A patent/EP1917626A1/fr not_active Withdrawn
- 2006-08-11 CA CA002620180A patent/CA2620180A1/fr not_active Abandoned
- 2006-08-11 WO PCT/EP2006/007939 patent/WO2007022877A1/fr active Application Filing
- 2006-08-11 CN CNA2006800311292A patent/CN101263512A/zh active Pending
- 2006-08-11 US US11/991,058 patent/US20090252415A1/en not_active Abandoned
Non-Patent Citations (1)
Title |
---|
See references of WO2007022877A1 * |
Also Published As
Publication number | Publication date |
---|---|
CN101263512A (zh) | 2008-09-10 |
CA2620180A1 (fr) | 2007-03-01 |
DE102005040687A1 (de) | 2007-03-01 |
WO2007022877A1 (fr) | 2007-03-01 |
US20090252415A1 (en) | 2009-10-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP1665132B1 (fr) | Procede et systeme de detection de donnees provenant de plusieurs documents lisibles par ordinateur | |
DE60308025T3 (de) | Identifikationsmarkieren von poststücken durch bildsignatur und zugehörige postbearbeitungsmaschine | |
DE69814104T2 (de) | Aufteilung von texten und identifizierung von themen | |
DE10027178B4 (de) | Magnetstreifen-Authentifizierungs-Verifizierungssystem | |
EP0980293B1 (fr) | Procede et dispositif de reconnaissance d'information de distribution de courrier | |
DE19705757A1 (de) | Verfahren und Gerät für das Design eines hoch-zuverlässigen Mustererkennungs-Systems | |
DE19511470C1 (de) | Verfahren zur Ermittlung eines Referenzschriftzuges anhand einer Menge von schreiberidentischen Musterschriftzügen | |
DE69926280T2 (de) | Verfahren zur Erkennung von Adressen und Briefverarbeitungsvorrichtung | |
WO1999016559A1 (fr) | Procede et dispositif de reconnaissance des donnees de distribution figurant sur les envois postaux | |
DE2435889B2 (de) | Verfahren und einrichtung zur unterscheidung von zeichengruppen | |
WO2007022880A1 (fr) | Procede d'identification d'envois a trier | |
EP1917626A1 (fr) | Procede pour retrouver des blocs de texte dans des documents | |
EP2259210A2 (fr) | Procédé et dispositif destinés à l'analyse d'une base de données | |
DE102006008936A1 (de) | Verfahren zum Erkennen von Objekten und Objekterkennungssystem | |
EP2273383A1 (fr) | Procédé et dispositif de recherche automatique de documents dans un dispositif de stockage de données | |
EP2084652A1 (fr) | Procédé et dispositif d'identification d'objets | |
EP1389493A1 (fr) | Procédé et dispositif pour le marquage automatique d'un champ d'adresse | |
DE3414455A1 (de) | Verfahren und vorrichtung zum lesen und speichern von information | |
EP1159705B1 (fr) | Procede de lecture d'entrees de documents et d'adresses | |
DE19820353C2 (de) | Verfahren und Vorrichtung zur Erkennung eines Musters auf einer Vorlage | |
EP0731955B1 (fr) | Procede et dispositif de saisie et d'identification automatique d'informations enregistrees | |
WO2003079273A2 (fr) | Procede et dispositif de lecture d'adresses d'envois | |
DE102009050681A1 (de) | Verfahren und Vorrichtung zum Erkennen und Klassifizieren von Dokumentteilen eines rechnerverfügbaren Dokuments durch schrittweises Lernen aus mehreren Trainingsmengen | |
DE19635351C2 (de) | Verfahren zur Formatkonvertierung | |
DE102009013390A1 (de) | Verfahren und Vorrichtung zum Klassifizieren eines physikalischen Objekts mittels eines parametrierten Klassifikators |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20080226 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC NL PL PT RO SE SI SK TR |
|
17Q | First examination report despatched |
Effective date: 20081001 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20090415 |