EP1679613A3 - Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents - Google Patents

Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents Download PDF

Info

Publication number
EP1679613A3
EP1679613A3 EP20060100200 EP06100200A EP1679613A3 EP 1679613 A3 EP1679613 A3 EP 1679613A3 EP 20060100200 EP20060100200 EP 20060100200 EP 06100200 A EP06100200 A EP 06100200A EP 1679613 A3 EP1679613 A3 EP 1679613A3
Authority
EP
European Patent Office
Prior art keywords
text
footer
header
textual
variability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
EP20060100200
Other languages
German (de)
French (fr)
Other versions
EP1679613A2 (en
Inventor
Hervé Dejean
Jean-Luc Meunier
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xerox Corp
Original Assignee
Xerox Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xerox Corp filed Critical Xerox Corp
Publication of EP1679613A2 publication Critical patent/EP1679613A2/en
Publication of EP1679613A3 publication Critical patent/EP1679613A3/en
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/114Pagination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for identifying header/footer content of a document, in order to sequence text fragments comprising recognizable text blocks as derived from the document. The textual variability of lines comprised of text blocks, including the different kinds of text blocks within the line is analyzed for assessment of textual variability. Header/footer zones are defined by textual content having a low textual variability. An alternative embodiment identifies pagination constructs by comparing selected text-boxes for similarity and proximity and clustering the text boxes satisfying a predetermined similarity value, wherein the clustered text boxes are deemed to comprise pagination constructs.
EP20060100200 2005-01-10 2006-01-10 Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents Ceased EP1679613A3 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11/032,817 US7937653B2 (en) 2005-01-10 2005-01-10 Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents

Publications (2)

Publication Number Publication Date
EP1679613A2 EP1679613A2 (en) 2006-07-12
EP1679613A3 true EP1679613A3 (en) 2009-10-14

Family

ID=36218159

Family Applications (1)

Application Number Title Priority Date Filing Date
EP20060100200 Ceased EP1679613A3 (en) 2005-01-10 2006-01-10 Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents

Country Status (3)

Country Link
US (2) US7937653B2 (en)
EP (1) EP1679613A3 (en)
JP (1) JP4974529B2 (en)

Families Citing this family (50)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8645821B2 (en) 2010-09-28 2014-02-04 Xerox Corporation System and method for page frame detection
US9110868B2 (en) * 2005-01-10 2015-08-18 Xerox Corporation System and method for logical structuring of documents based on trailing and leading pages
US7937653B2 (en) 2005-01-10 2011-05-03 Xerox Corporation Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents
US7584424B2 (en) * 2005-08-19 2009-09-01 Vista Print Technologies Limited Automated product layout
US7676744B2 (en) * 2005-08-19 2010-03-09 Vistaprint Technologies Limited Automated markup language layout
US7797622B2 (en) * 2006-11-15 2010-09-14 Xerox Corporation Versatile page number detector
JP4436851B2 (en) * 2007-05-22 2010-03-24 シャープ株式会社 Printer driver program and image forming apparatus
US8023740B2 (en) * 2007-08-13 2011-09-20 Xerox Corporation Systems and methods for notes detection
US9224041B2 (en) * 2007-10-25 2015-12-29 Xerox Corporation Table of contents extraction based on textual similarity and formal aspects
US9135249B2 (en) * 2009-05-29 2015-09-15 Xerox Corporation Number sequences detection systems and methods
US8739030B2 (en) * 2010-03-10 2014-05-27 Salesforce.Com, Inc. Providing a quote template in a multi-tenant database system environment
US8606789B2 (en) 2010-07-02 2013-12-10 Xerox Corporation Method for layout based document zone querying
US9218322B2 (en) * 2010-07-28 2015-12-22 Hewlett-Packard Development Company, L.P. Producing web page content
US8340425B2 (en) 2010-08-10 2012-12-25 Xerox Corporation Optical character recognition with two-pass zoning
US8798366B1 (en) * 2010-12-28 2014-08-05 Amazon Technologies, Inc. Electronic book pagination
US9069767B1 (en) 2010-12-28 2015-06-30 Amazon Technologies, Inc. Aligning content items to identify differences
US9846688B1 (en) 2010-12-28 2017-12-19 Amazon Technologies, Inc. Book version mapping
US9881009B1 (en) 2011-03-15 2018-01-30 Amazon Technologies, Inc. Identifying book title sets
US8560937B2 (en) 2011-06-07 2013-10-15 Xerox Corporation Generate-and-test method for column segmentation
US8645819B2 (en) 2011-06-17 2014-02-04 Xerox Corporation Detection and extraction of elements constituting images in unstructured document files
US10540426B2 (en) * 2011-07-11 2020-01-21 Paper Software LLC System and method for processing document
WO2013013335A1 (en) * 2011-07-22 2013-01-31 Hewlett-Packard Development Company, L.P. Automated document composition using clusters
US8478046B2 (en) 2011-11-03 2013-07-02 Xerox Corporation Signature mark detection
EP2807602A1 (en) * 2012-01-23 2014-12-03 Microsoft Corporation Pattern matching engine
US9524274B2 (en) 2013-06-06 2016-12-20 Xerox Corporation Methods and systems for generation of document structures based on sequential constraints
US10803233B2 (en) 2012-05-31 2020-10-13 Conduent Business Services Llc Method and system of extracting structured data from a document
US9008443B2 (en) 2012-06-22 2015-04-14 Xerox Corporation System and method for identifying regular geometric structures in document pages
US8830487B2 (en) 2012-07-09 2014-09-09 Xerox Corporation System and method for separating image and text in a document
US8812870B2 (en) 2012-10-10 2014-08-19 Xerox Corporation Confidentiality preserving document analysis system and method
US9008425B2 (en) 2013-01-29 2015-04-14 Xerox Corporation Detection of numbered captions
US20140230075A1 (en) 2013-02-08 2014-08-14 Xerox Corporation Physical and electronic book reconciliation
US9672195B2 (en) 2013-12-24 2017-06-06 Xerox Corporation Method and system for page construct detection based on sequential regularities
US10303745B2 (en) 2014-06-16 2019-05-28 Hewlett-Packard Development Company, L.P. Pagination point identification
RU2604668C2 (en) * 2014-06-17 2016-12-10 Общество с ограниченной ответственностью "Аби Девелопмент" Rendering computer-generated document image
JP6063964B2 (en) * 2015-01-15 2017-01-18 京セラドキュメントソリューションズ株式会社 Image processing apparatus and image forming apparatus
US9530070B2 (en) 2015-04-29 2016-12-27 Procore Technologies, Inc. Text parsing in complex graphical images
US10997362B2 (en) * 2016-09-01 2021-05-04 Wacom Co., Ltd. Method and system for input areas in documents for handwriting devices
US11309075B2 (en) * 2016-12-29 2022-04-19 Cerner Innovation, Inc. Generation of a transaction set
US11475209B2 (en) 2017-10-17 2022-10-18 Handycontract Llc Device, system, and method for extracting named entities from sectioned documents
WO2019077405A1 (en) * 2017-10-17 2019-04-25 Handycontract, LLC Method, device, and system, for identifying data elements in data structures
US10650186B2 (en) 2018-06-08 2020-05-12 Handycontract, LLC Device, system and method for displaying sectioned documents
US10534846B1 (en) * 2018-10-17 2020-01-14 Pricewaterhousecoopers Llp Page stream segmentation
KR102345625B1 (en) * 2019-02-01 2021-12-31 삼성전자주식회사 Caption generation method and apparatus for performing the same
CN110543810A (en) * 2019-06-28 2019-12-06 南京智录信息科技有限公司 Technology for completely identifying header and footer of PDF (Portable document Format) file
JP7396032B2 (en) * 2019-12-24 2023-12-12 京セラドキュメントソリューションズ株式会社 Image forming device
US20210319180A1 (en) 2020-01-24 2021-10-14 Thomson Reuters Enterprise Centre Gmbh Systems and methods for deviation detection, information extraction and obligation deviation detection
CN112036132B (en) * 2020-09-01 2024-04-19 珠海豹趣科技有限公司 Method and device for editing header and footer of document and electronic equipment
WO2022164796A1 (en) 2021-01-26 2022-08-04 California Institute Of Technology Allosteric conditional guide rnas for cell-selective regulation of crispr/cas
CN113065154B (en) * 2021-03-19 2023-12-29 深信服科技股份有限公司 Document detection method, device, equipment and storage medium
CN117058704B (en) * 2023-09-15 2024-01-05 之江实验室 Teaching material content and structure extraction method and device based on visual and text characteristics

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5434962A (en) * 1990-09-07 1995-07-18 Fuji Xerox Co., Ltd. Method and system for automatically generating logical structures of electronic documents
JPH06214983A (en) * 1993-01-20 1994-08-05 Kokusai Denshin Denwa Co Ltd <Kdd> Method and device for converting document picture to logical structuring document
US5491628A (en) * 1993-12-10 1996-02-13 Xerox Corporation Method and apparatus for document transformation based on attribute grammars and attribute couplings
US6470306B1 (en) * 1996-04-23 2002-10-22 Logovista Corporation Automated translation of annotated text based on the determination of locations for inserting annotation tokens and linked ending, end-of-sentence or language tokens
US6298357B1 (en) * 1997-06-03 2001-10-02 Adobe Systems Incorporated Structure extraction on electronic documents
US6353840B2 (en) * 1997-08-15 2002-03-05 Ricoh Company, Ltd. User-defined search template for extracting information from documents
JPH1166196A (en) * 1997-08-15 1999-03-09 Ricoh Co Ltd Document image recognition device and computer-readable recording medium where program allowing computer to function as same device is recorded
IE980959A1 (en) * 1998-03-31 1999-10-20 Datapage Ireland Ltd Document Production
US6487566B1 (en) * 1998-10-05 2002-11-26 International Business Machines Corporation Transforming documents using pattern matching and a replacement language
US20020143818A1 (en) * 2001-03-30 2002-10-03 Roberts Elizabeth A. System for generating a structured document
JP2003150586A (en) * 2001-11-12 2003-05-23 Ntt Docomo Inc Document converting system, document converting method and computer-readable recording medium with document converting program recorded thereon
US20040205568A1 (en) * 2002-03-01 2004-10-14 Breuel Thomas M. Method and system for document image layout deconstruction and redisplay system
US6907431B2 (en) * 2002-05-03 2005-06-14 Hewlett-Packard Development Company, L.P. Method for determining a logical structure of a document
JP2004178010A (en) * 2002-11-22 2004-06-24 Toshiba Corp Document processor, its method, and program
US7310773B2 (en) * 2003-01-13 2007-12-18 Hewlett-Packard Development Company, L.P. Removal of extraneous text from electronic documents
US7165216B2 (en) * 2004-01-14 2007-01-16 Xerox Corporation Systems and methods for converting legacy and proprietary documents into extended mark-up language format
US8495061B1 (en) * 2004-09-29 2013-07-23 Google Inc. Automatic metadata identification
US7440967B2 (en) * 2004-11-10 2008-10-21 Xerox Corporation System and method for transforming legacy documents into XML documents
US7937653B2 (en) 2005-01-10 2011-05-03 Xerox Corporation Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents
US7693848B2 (en) 2005-01-10 2010-04-06 Xerox Corporation Method and apparatus for structuring documents based on layout, content and collection
US8706475B2 (en) * 2005-01-10 2014-04-22 Xerox Corporation Method and apparatus for detecting a table of contents and reference determination
JP4789516B2 (en) * 2005-06-14 2011-10-12 キヤノン株式会社 Document conversion apparatus, document conversion method, and storage medium

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Document Recognition and Retreval X (Proceedings of SPIE/IS&T): Header and Footer Extraction by Page-Association from Xiaofan Lin", IMAGING ONLINE PUBLICATIONS CATALOG, vol. 5010, XP002543908, Retrieved from the Internet <URL:http://www.imaging.org/ScriptContent/store/physpub.cfm?seriesid=24&pubid=571> [retrieved on 20090623] *
HERVÉ DÉJEAN, JEAN-LUC MEUNIER: "Versatile Page Numbering Analysis", XEROX PUBLICATIONS, XP002533713, Retrieved from the Internet <URL:http://www.xrce.xerox.com/Publications/Attachments/2007-031/2007-031.pdf> [retrieved on 20090624] *
XIAOFAN LIN: "Header and Footer Extraction by Page-Association", 2002, pages 1 - 8, XP002533579, Retrieved from the Internet <URL:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.6211> [retrieved on 20090623] *
XIAOFAN LIN: "Header and Footer Extraction by Page-Association", CITESEERX ABSTRACT, 2002, XP002543906, Retrieved from the Internet <URL:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.6211> [retrieved on 20090623] *
XIAOFAN LIN: "Header and Footer Extraction by Page-Association", DOCUMENT RECOGNITION AND RETRIEVAL X- DRR 2003: SANTA CLARA, CALIFORNIA, USA (LAYOUT ANALYSIS), Universität Trier, pages 1 - 3, XP002543907, Retrieved from the Internet <URL:http://www.informatik.uni-trier.de/~ley/db/conf/drr/drr2003.html#Lin03> [retrieved on 20090623] *
XIAOFAN LIN: "Header and Footer Extraction by Page-Association", HP LABS TECHNICAL REPORTS, 26 December 2002 (2002-12-26), XP002533592, Retrieved from the Internet <URL:http://web.archive.org/web/20021226015423/http://www.hpl.hp.com/techreports/2002/HPL-2002-129.html> [retrieved on 20090623] *

Also Published As

Publication number Publication date
US7937653B2 (en) 2011-05-03
JP2006195980A (en) 2006-07-27
US20110145701A1 (en) 2011-06-16
US20060156226A1 (en) 2006-07-13
EP1679613A2 (en) 2006-07-12
JP4974529B2 (en) 2012-07-11
US9218326B2 (en) 2015-12-22

Similar Documents

Publication Publication Date Title
EP1679613A3 (en) Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents
EP1679623A3 (en) Method and apparatus for detecting a table of contents and reference determination
EP1669896A3 (en) A machine learning system for extracting structured records from web pages and other text sources
EP1615154A3 (en) Method and software for extracting chemical data
Johnston BSL, Auslan and NZSL: three signed languages or one?
EP1589443A3 (en) Method, system or memory storing a computer program for document processing
WO2007038389A3 (en) Method and apparatus for identifying and classifying network documents as spam
EP1679625A3 (en) Method and apparatus for structuring documents based on layout, content and collection
EP1635268A3 (en) Freeform digital ink annotation recognition
WO2007130544A3 (en) Method for domain identification of documents in a document database
GB2448275A (en) Document analysis system for integration of paper records into a searchable electronic database
EP1962208A3 (en) System and method for searching annotated document collections
WO2009098468A3 (en) A method and system of indexing numerical data
WO2008100849A3 (en) Semantics-based method and system for document analysis
EP2444920A3 (en) Detection of duplicate document content using two-dimensional visual fingerprinting
EP1736901A3 (en) Method for classifying sub-trees in semi-structured documents
EP2634709A3 (en) System and method for appending security information to search engine results
EP1986160A3 (en) Document processing system control using document feature analysis for identification
WO2006132793A3 (en) Learning facts from semi-structured text
WO2007059232A3 (en) Methods and apparatus for probe-based clustering
EP1909194A4 (en) Information processing device, feature extraction method, recording medium, and program
EP1233349A3 (en) Data display method and apparatus for use in text mining
EP1739573A3 (en) Probabilistic learning method for XML annotation of documents
JP2002245070A5 (en)
EP1965312A3 (en) Information processing apparatus and method, program, and storage medium

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC NL PL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL BA HR MK YU

PUAL Search report despatched

Free format text: ORIGINAL CODE: 0009013

AK Designated contracting states

Kind code of ref document: A3

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC NL PL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL BA HR MK YU

17P Request for examination filed

Effective date: 20100414

AKX Designation fees paid

Designated state(s): DE FR GB

17Q First examination report despatched

Effective date: 20100601

APBK Appeal reference recorded

Free format text: ORIGINAL CODE: EPIDOSNREFNE

APBN Date of receipt of notice of appeal recorded

Free format text: ORIGINAL CODE: EPIDOSNNOA2E

APBR Date of receipt of statement of grounds of appeal recorded

Free format text: ORIGINAL CODE: EPIDOSNNOA3E

APAF Appeal reference modified

Free format text: ORIGINAL CODE: EPIDOSCREFNE

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: XEROX CORPORATION

APAF Appeal reference modified

Free format text: ORIGINAL CODE: EPIDOSCREFNE

APBT Appeal procedure closed

Free format text: ORIGINAL CODE: EPIDOSNNOA9E

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED

18R Application refused

Effective date: 20161005