EP1679613A3 - Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents - Google Patents
Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents Download PDFInfo
- Publication number
- EP1679613A3 EP1679613A3 EP20060100200 EP06100200A EP1679613A3 EP 1679613 A3 EP1679613 A3 EP 1679613A3 EP 20060100200 EP20060100200 EP 20060100200 EP 06100200 A EP06100200 A EP 06100200A EP 1679613 A3 EP1679613 A3 EP 1679613A3
- Authority
- EP
- European Patent Office
- Prior art keywords
- text
- footer
- header
- textual
- variability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/114—Pagination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/032,817 US7937653B2 (en) | 2005-01-10 | 2005-01-10 | Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents |
Publications (2)
Publication Number | Publication Date |
---|---|
EP1679613A2 EP1679613A2 (en) | 2006-07-12 |
EP1679613A3 true EP1679613A3 (en) | 2009-10-14 |
Family
ID=36218159
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP20060100200 Ceased EP1679613A3 (en) | 2005-01-10 | 2006-01-10 | Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents |
Country Status (3)
Country | Link |
---|---|
US (2) | US7937653B2 (en) |
EP (1) | EP1679613A3 (en) |
JP (1) | JP4974529B2 (en) |
Families Citing this family (50)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8645821B2 (en) | 2010-09-28 | 2014-02-04 | Xerox Corporation | System and method for page frame detection |
US9110868B2 (en) * | 2005-01-10 | 2015-08-18 | Xerox Corporation | System and method for logical structuring of documents based on trailing and leading pages |
US7937653B2 (en) | 2005-01-10 | 2011-05-03 | Xerox Corporation | Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents |
US7584424B2 (en) * | 2005-08-19 | 2009-09-01 | Vista Print Technologies Limited | Automated product layout |
US7676744B2 (en) * | 2005-08-19 | 2010-03-09 | Vistaprint Technologies Limited | Automated markup language layout |
US7797622B2 (en) * | 2006-11-15 | 2010-09-14 | Xerox Corporation | Versatile page number detector |
JP4436851B2 (en) * | 2007-05-22 | 2010-03-24 | シャープ株式会社 | Printer driver program and image forming apparatus |
US8023740B2 (en) * | 2007-08-13 | 2011-09-20 | Xerox Corporation | Systems and methods for notes detection |
US9224041B2 (en) * | 2007-10-25 | 2015-12-29 | Xerox Corporation | Table of contents extraction based on textual similarity and formal aspects |
US9135249B2 (en) * | 2009-05-29 | 2015-09-15 | Xerox Corporation | Number sequences detection systems and methods |
US8739030B2 (en) * | 2010-03-10 | 2014-05-27 | Salesforce.Com, Inc. | Providing a quote template in a multi-tenant database system environment |
US8606789B2 (en) | 2010-07-02 | 2013-12-10 | Xerox Corporation | Method for layout based document zone querying |
US9218322B2 (en) * | 2010-07-28 | 2015-12-22 | Hewlett-Packard Development Company, L.P. | Producing web page content |
US8340425B2 (en) | 2010-08-10 | 2012-12-25 | Xerox Corporation | Optical character recognition with two-pass zoning |
US8798366B1 (en) * | 2010-12-28 | 2014-08-05 | Amazon Technologies, Inc. | Electronic book pagination |
US9069767B1 (en) | 2010-12-28 | 2015-06-30 | Amazon Technologies, Inc. | Aligning content items to identify differences |
US9846688B1 (en) | 2010-12-28 | 2017-12-19 | Amazon Technologies, Inc. | Book version mapping |
US9881009B1 (en) | 2011-03-15 | 2018-01-30 | Amazon Technologies, Inc. | Identifying book title sets |
US8560937B2 (en) | 2011-06-07 | 2013-10-15 | Xerox Corporation | Generate-and-test method for column segmentation |
US8645819B2 (en) | 2011-06-17 | 2014-02-04 | Xerox Corporation | Detection and extraction of elements constituting images in unstructured document files |
US10540426B2 (en) * | 2011-07-11 | 2020-01-21 | Paper Software LLC | System and method for processing document |
WO2013013335A1 (en) * | 2011-07-22 | 2013-01-31 | Hewlett-Packard Development Company, L.P. | Automated document composition using clusters |
US8478046B2 (en) | 2011-11-03 | 2013-07-02 | Xerox Corporation | Signature mark detection |
EP2807602A1 (en) * | 2012-01-23 | 2014-12-03 | Microsoft Corporation | Pattern matching engine |
US9524274B2 (en) | 2013-06-06 | 2016-12-20 | Xerox Corporation | Methods and systems for generation of document structures based on sequential constraints |
US10803233B2 (en) | 2012-05-31 | 2020-10-13 | Conduent Business Services Llc | Method and system of extracting structured data from a document |
US9008443B2 (en) | 2012-06-22 | 2015-04-14 | Xerox Corporation | System and method for identifying regular geometric structures in document pages |
US8830487B2 (en) | 2012-07-09 | 2014-09-09 | Xerox Corporation | System and method for separating image and text in a document |
US8812870B2 (en) | 2012-10-10 | 2014-08-19 | Xerox Corporation | Confidentiality preserving document analysis system and method |
US9008425B2 (en) | 2013-01-29 | 2015-04-14 | Xerox Corporation | Detection of numbered captions |
US20140230075A1 (en) | 2013-02-08 | 2014-08-14 | Xerox Corporation | Physical and electronic book reconciliation |
US9672195B2 (en) | 2013-12-24 | 2017-06-06 | Xerox Corporation | Method and system for page construct detection based on sequential regularities |
US10303745B2 (en) | 2014-06-16 | 2019-05-28 | Hewlett-Packard Development Company, L.P. | Pagination point identification |
RU2604668C2 (en) * | 2014-06-17 | 2016-12-10 | Общество с ограниченной ответственностью "Аби Девелопмент" | Rendering computer-generated document image |
JP6063964B2 (en) * | 2015-01-15 | 2017-01-18 | 京セラドキュメントソリューションズ株式会社 | Image processing apparatus and image forming apparatus |
US9530070B2 (en) | 2015-04-29 | 2016-12-27 | Procore Technologies, Inc. | Text parsing in complex graphical images |
US10997362B2 (en) * | 2016-09-01 | 2021-05-04 | Wacom Co., Ltd. | Method and system for input areas in documents for handwriting devices |
US11309075B2 (en) * | 2016-12-29 | 2022-04-19 | Cerner Innovation, Inc. | Generation of a transaction set |
US11475209B2 (en) | 2017-10-17 | 2022-10-18 | Handycontract Llc | Device, system, and method for extracting named entities from sectioned documents |
WO2019077405A1 (en) * | 2017-10-17 | 2019-04-25 | Handycontract, LLC | Method, device, and system, for identifying data elements in data structures |
US10650186B2 (en) | 2018-06-08 | 2020-05-12 | Handycontract, LLC | Device, system and method for displaying sectioned documents |
US10534846B1 (en) * | 2018-10-17 | 2020-01-14 | Pricewaterhousecoopers Llp | Page stream segmentation |
KR102345625B1 (en) * | 2019-02-01 | 2021-12-31 | 삼성전자주식회사 | Caption generation method and apparatus for performing the same |
CN110543810A (en) * | 2019-06-28 | 2019-12-06 | 南京智录信息科技有限公司 | Technology for completely identifying header and footer of PDF (Portable document Format) file |
JP7396032B2 (en) * | 2019-12-24 | 2023-12-12 | 京セラドキュメントソリューションズ株式会社 | Image forming device |
US20210319180A1 (en) | 2020-01-24 | 2021-10-14 | Thomson Reuters Enterprise Centre Gmbh | Systems and methods for deviation detection, information extraction and obligation deviation detection |
CN112036132B (en) * | 2020-09-01 | 2024-04-19 | 珠海豹趣科技有限公司 | Method and device for editing header and footer of document and electronic equipment |
WO2022164796A1 (en) | 2021-01-26 | 2022-08-04 | California Institute Of Technology | Allosteric conditional guide rnas for cell-selective regulation of crispr/cas |
CN113065154B (en) * | 2021-03-19 | 2023-12-29 | 深信服科技股份有限公司 | Document detection method, device, equipment and storage medium |
CN117058704B (en) * | 2023-09-15 | 2024-01-05 | 之江实验室 | Teaching material content and structure extraction method and device based on visual and text characteristics |
Family Cites Families (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5434962A (en) * | 1990-09-07 | 1995-07-18 | Fuji Xerox Co., Ltd. | Method and system for automatically generating logical structures of electronic documents |
JPH06214983A (en) * | 1993-01-20 | 1994-08-05 | Kokusai Denshin Denwa Co Ltd <Kdd> | Method and device for converting document picture to logical structuring document |
US5491628A (en) * | 1993-12-10 | 1996-02-13 | Xerox Corporation | Method and apparatus for document transformation based on attribute grammars and attribute couplings |
US6470306B1 (en) * | 1996-04-23 | 2002-10-22 | Logovista Corporation | Automated translation of annotated text based on the determination of locations for inserting annotation tokens and linked ending, end-of-sentence or language tokens |
US6298357B1 (en) * | 1997-06-03 | 2001-10-02 | Adobe Systems Incorporated | Structure extraction on electronic documents |
US6353840B2 (en) * | 1997-08-15 | 2002-03-05 | Ricoh Company, Ltd. | User-defined search template for extracting information from documents |
JPH1166196A (en) * | 1997-08-15 | 1999-03-09 | Ricoh Co Ltd | Document image recognition device and computer-readable recording medium where program allowing computer to function as same device is recorded |
IE980959A1 (en) * | 1998-03-31 | 1999-10-20 | Datapage Ireland Ltd | Document Production |
US6487566B1 (en) * | 1998-10-05 | 2002-11-26 | International Business Machines Corporation | Transforming documents using pattern matching and a replacement language |
US20020143818A1 (en) * | 2001-03-30 | 2002-10-03 | Roberts Elizabeth A. | System for generating a structured document |
JP2003150586A (en) * | 2001-11-12 | 2003-05-23 | Ntt Docomo Inc | Document converting system, document converting method and computer-readable recording medium with document converting program recorded thereon |
US20040205568A1 (en) * | 2002-03-01 | 2004-10-14 | Breuel Thomas M. | Method and system for document image layout deconstruction and redisplay system |
US6907431B2 (en) * | 2002-05-03 | 2005-06-14 | Hewlett-Packard Development Company, L.P. | Method for determining a logical structure of a document |
JP2004178010A (en) * | 2002-11-22 | 2004-06-24 | Toshiba Corp | Document processor, its method, and program |
US7310773B2 (en) * | 2003-01-13 | 2007-12-18 | Hewlett-Packard Development Company, L.P. | Removal of extraneous text from electronic documents |
US7165216B2 (en) * | 2004-01-14 | 2007-01-16 | Xerox Corporation | Systems and methods for converting legacy and proprietary documents into extended mark-up language format |
US8495061B1 (en) * | 2004-09-29 | 2013-07-23 | Google Inc. | Automatic metadata identification |
US7440967B2 (en) * | 2004-11-10 | 2008-10-21 | Xerox Corporation | System and method for transforming legacy documents into XML documents |
US7937653B2 (en) | 2005-01-10 | 2011-05-03 | Xerox Corporation | Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents |
US7693848B2 (en) | 2005-01-10 | 2010-04-06 | Xerox Corporation | Method and apparatus for structuring documents based on layout, content and collection |
US8706475B2 (en) * | 2005-01-10 | 2014-04-22 | Xerox Corporation | Method and apparatus for detecting a table of contents and reference determination |
JP4789516B2 (en) * | 2005-06-14 | 2011-10-12 | キヤノン株式会社 | Document conversion apparatus, document conversion method, and storage medium |
-
2005
- 2005-01-10 US US11/032,817 patent/US7937653B2/en not_active Expired - Fee Related
-
2006
- 2006-01-05 JP JP2006000315A patent/JP4974529B2/en not_active Expired - Fee Related
- 2006-01-10 EP EP20060100200 patent/EP1679613A3/en not_active Ceased
-
2011
- 2011-02-23 US US13/032,996 patent/US9218326B2/en active Active
Non-Patent Citations (6)
Title |
---|
ANONYMOUS: "Document Recognition and Retreval X (Proceedings of SPIE/IS&T): Header and Footer Extraction by Page-Association from Xiaofan Lin", IMAGING ONLINE PUBLICATIONS CATALOG, vol. 5010, XP002543908, Retrieved from the Internet <URL:http://www.imaging.org/ScriptContent/store/physpub.cfm?seriesid=24&pubid=571> [retrieved on 20090623] * |
HERVÉ DÉJEAN, JEAN-LUC MEUNIER: "Versatile Page Numbering Analysis", XEROX PUBLICATIONS, XP002533713, Retrieved from the Internet <URL:http://www.xrce.xerox.com/Publications/Attachments/2007-031/2007-031.pdf> [retrieved on 20090624] * |
XIAOFAN LIN: "Header and Footer Extraction by Page-Association", 2002, pages 1 - 8, XP002533579, Retrieved from the Internet <URL:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.6211> [retrieved on 20090623] * |
XIAOFAN LIN: "Header and Footer Extraction by Page-Association", CITESEERX ABSTRACT, 2002, XP002543906, Retrieved from the Internet <URL:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.6211> [retrieved on 20090623] * |
XIAOFAN LIN: "Header and Footer Extraction by Page-Association", DOCUMENT RECOGNITION AND RETRIEVAL X- DRR 2003: SANTA CLARA, CALIFORNIA, USA (LAYOUT ANALYSIS), Universität Trier, pages 1 - 3, XP002543907, Retrieved from the Internet <URL:http://www.informatik.uni-trier.de/~ley/db/conf/drr/drr2003.html#Lin03> [retrieved on 20090623] * |
XIAOFAN LIN: "Header and Footer Extraction by Page-Association", HP LABS TECHNICAL REPORTS, 26 December 2002 (2002-12-26), XP002533592, Retrieved from the Internet <URL:http://web.archive.org/web/20021226015423/http://www.hpl.hp.com/techreports/2002/HPL-2002-129.html> [retrieved on 20090623] * |
Also Published As
Publication number | Publication date |
---|---|
US7937653B2 (en) | 2011-05-03 |
JP2006195980A (en) | 2006-07-27 |
US20110145701A1 (en) | 2011-06-16 |
US20060156226A1 (en) | 2006-07-13 |
EP1679613A2 (en) | 2006-07-12 |
JP4974529B2 (en) | 2012-07-11 |
US9218326B2 (en) | 2015-12-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP1679613A3 (en) | Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents | |
EP1679623A3 (en) | Method and apparatus for detecting a table of contents and reference determination | |
EP1669896A3 (en) | A machine learning system for extracting structured records from web pages and other text sources | |
EP1615154A3 (en) | Method and software for extracting chemical data | |
Johnston | BSL, Auslan and NZSL: three signed languages or one? | |
EP1589443A3 (en) | Method, system or memory storing a computer program for document processing | |
WO2007038389A3 (en) | Method and apparatus for identifying and classifying network documents as spam | |
EP1679625A3 (en) | Method and apparatus for structuring documents based on layout, content and collection | |
EP1635268A3 (en) | Freeform digital ink annotation recognition | |
WO2007130544A3 (en) | Method for domain identification of documents in a document database | |
GB2448275A (en) | Document analysis system for integration of paper records into a searchable electronic database | |
EP1962208A3 (en) | System and method for searching annotated document collections | |
WO2009098468A3 (en) | A method and system of indexing numerical data | |
WO2008100849A3 (en) | Semantics-based method and system for document analysis | |
EP2444920A3 (en) | Detection of duplicate document content using two-dimensional visual fingerprinting | |
EP1736901A3 (en) | Method for classifying sub-trees in semi-structured documents | |
EP2634709A3 (en) | System and method for appending security information to search engine results | |
EP1986160A3 (en) | Document processing system control using document feature analysis for identification | |
WO2006132793A3 (en) | Learning facts from semi-structured text | |
WO2007059232A3 (en) | Methods and apparatus for probe-based clustering | |
EP1909194A4 (en) | Information processing device, feature extraction method, recording medium, and program | |
EP1233349A3 (en) | Data display method and apparatus for use in text mining | |
EP1739573A3 (en) | Probabilistic learning method for XML annotation of documents | |
JP2002245070A5 (en) | ||
EP1965312A3 (en) | Information processing apparatus and method, program, and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
AK | Designated contracting states |
Kind code of ref document: A2 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC NL PL PT RO SE SI SK TR |
|
AX | Request for extension of the european patent |
Extension state: AL BA HR MK YU |
|
PUAL | Search report despatched |
Free format text: ORIGINAL CODE: 0009013 |
|
AK | Designated contracting states |
Kind code of ref document: A3 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC NL PL PT RO SE SI SK TR |
|
AX | Request for extension of the european patent |
Extension state: AL BA HR MK YU |
|
17P | Request for examination filed |
Effective date: 20100414 |
|
AKX | Designation fees paid |
Designated state(s): DE FR GB |
|
17Q | First examination report despatched |
Effective date: 20100601 |
|
APBK | Appeal reference recorded |
Free format text: ORIGINAL CODE: EPIDOSNREFNE |
|
APBN | Date of receipt of notice of appeal recorded |
Free format text: ORIGINAL CODE: EPIDOSNNOA2E |
|
APBR | Date of receipt of statement of grounds of appeal recorded |
Free format text: ORIGINAL CODE: EPIDOSNNOA3E |
|
APAF | Appeal reference modified |
Free format text: ORIGINAL CODE: EPIDOSCREFNE |
|
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: XEROX CORPORATION |
|
APAF | Appeal reference modified |
Free format text: ORIGINAL CODE: EPIDOSCREFNE |
|
APBT | Appeal procedure closed |
Free format text: ORIGINAL CODE: EPIDOSNNOA9E |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED |
|
18R | Application refused |
Effective date: 20161005 |