CN103176956A - 用于提取文档结构的方法和装置 - Google Patents
用于提取文档结构的方法和装置 Download PDFInfo
- Publication number
- CN103176956A CN103176956A CN2011104388582A CN201110438858A CN103176956A CN 103176956 A CN103176956 A CN 103176956A CN 2011104388582 A CN2011104388582 A CN 2011104388582A CN 201110438858 A CN201110438858 A CN 201110438858A CN 103176956 A CN103176956 A CN 103176956A
- Authority
- CN
- China
- Prior art keywords
- page
- row
- list
- entry
- class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000000605 extraction Methods 0.000 claims abstract description 19
- 239000000284 extract Substances 0.000 claims description 25
- 238000007639 printing Methods 0.000 claims description 6
- 238000012217 deletion Methods 0.000 claims description 2
- 230000037430 deletion Effects 0.000 claims description 2
- 238000012545 processing Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 6
- 230000006399 behavior Effects 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 5
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
Abstract
Description
Claims (10)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110438858.2A CN103176956B (zh) | 2011-12-21 | 2011-12-21 | 用于提取文档结构的方法和装置 |
US13/725,879 US9418051B2 (en) | 2011-12-21 | 2012-12-21 | Methods and devices for extracting document structure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110438858.2A CN103176956B (zh) | 2011-12-21 | 2011-12-21 | 用于提取文档结构的方法和装置 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103176956A true CN103176956A (zh) | 2013-06-26 |
CN103176956B CN103176956B (zh) | 2016-08-03 |
Family
ID=48636842
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110438858.2A Active CN103176956B (zh) | 2011-12-21 | 2011-12-21 | 用于提取文档结构的方法和装置 |
Country Status (2)
Country | Link |
---|---|
US (1) | US9418051B2 (zh) |
CN (1) | CN103176956B (zh) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107633040A (zh) * | 2017-09-13 | 2018-01-26 | 张贝贝 | 一种按涉及重大重组主题的pdf文件切割方法 |
CN107633039A (zh) * | 2017-09-13 | 2018-01-26 | 张贝贝 | 一种按涉及股权转让主题的pdf文件切割方法 |
CN108446264A (zh) * | 2018-03-26 | 2018-08-24 | 阿博茨德(北京)科技有限公司 | Pdf文档中的表格矢量解析方法及装置 |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10204095B1 (en) * | 2015-02-10 | 2019-02-12 | West Corporation | Processing and delivery of private electronic documents |
CN111767254B (zh) * | 2020-07-07 | 2021-01-05 | 江苏中威科技软件系统有限公司 | 基于版式数据流文件技术的多文件阅读装置及其方法 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080320579A1 (en) * | 2007-06-21 | 2008-12-25 | Thomson Corporation | Method and system for validating references |
CN101354727A (zh) * | 2008-09-24 | 2009-01-28 | 北京大学 | 一种建立数字文档目录与正文之间链接的方法及装置 |
CN101937462A (zh) * | 2010-09-03 | 2011-01-05 | 中国科学院声学研究所 | 文献自动评价方法及系统 |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7149957B2 (en) * | 2001-11-19 | 2006-12-12 | Ricoh Company, Ltd. | Techniques for retrieving multimedia information using a paper-based interface |
US7142728B2 (en) * | 2002-05-17 | 2006-11-28 | Science Applications International Corporation | Method and system for extracting information from a document |
US20060242180A1 (en) * | 2003-07-23 | 2006-10-26 | Graf James A | Extracting data from semi-structured text documents |
JP4314221B2 (ja) * | 2005-07-28 | 2009-08-12 | 株式会社東芝 | 構造化文書記憶装置、構造化文書検索装置、構造化文書システム、方法およびプログラム |
US20070124319A1 (en) * | 2005-11-28 | 2007-05-31 | Microsoft Corporation | Metadata generation for rich media |
JP5248845B2 (ja) * | 2006-12-13 | 2013-07-31 | キヤノン株式会社 | 文書処理装置、文書処理方法、プログラムおよび記憶媒体 |
WO2009047570A1 (en) * | 2007-10-10 | 2009-04-16 | Iti Scotland Limited | Information extraction apparatus and methods |
US9002100B2 (en) * | 2008-04-02 | 2015-04-07 | Xerox Corporation | Model uncertainty visualization for active learning |
US8655803B2 (en) * | 2008-12-17 | 2014-02-18 | Xerox Corporation | Method of feature extraction from noisy documents |
US20130205202A1 (en) * | 2010-10-26 | 2013-08-08 | Jun Xiao | Transformation of a Document into Interactive Media Content |
-
2011
- 2011-12-21 CN CN201110438858.2A patent/CN103176956B/zh active Active
-
2012
- 2012-12-21 US US13/725,879 patent/US9418051B2/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080320579A1 (en) * | 2007-06-21 | 2008-12-25 | Thomson Corporation | Method and system for validating references |
CN101354727A (zh) * | 2008-09-24 | 2009-01-28 | 北京大学 | 一种建立数字文档目录与正文之间链接的方法及装置 |
CN101937462A (zh) * | 2010-09-03 | 2011-01-05 | 中国科学院声学研究所 | 文献自动评价方法及系统 |
Non-Patent Citations (3)
Title |
---|
李朝光等: "论文元数据信息的自动抽取", 《计算机工程与应用》 * |
郭志鑫等: "SemreX中基于语义的文档参考文献元数据信息提取", 《计算机研究与发展》 * |
陈路瑶等: "信息文档结构信任模式的提取及逻辑描述", 《计算机应用研究》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107633040A (zh) * | 2017-09-13 | 2018-01-26 | 张贝贝 | 一种按涉及重大重组主题的pdf文件切割方法 |
CN107633039A (zh) * | 2017-09-13 | 2018-01-26 | 张贝贝 | 一种按涉及股权转让主题的pdf文件切割方法 |
CN108446264A (zh) * | 2018-03-26 | 2018-08-24 | 阿博茨德(北京)科技有限公司 | Pdf文档中的表格矢量解析方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
US20130167018A1 (en) | 2013-06-27 |
CN103176956B (zh) | 2016-08-03 |
US9418051B2 (en) | 2016-08-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10592184B2 (en) | Method and device for parsing tables in PDF document | |
US9798925B2 (en) | Method for identifying PDF document | |
CN101770446B (zh) | 一种版式文件中表格识别方法及系统 | |
CN111325110A (zh) | 基于ocr的表格版式恢复方法、装置及存储介质 | |
CN101206639A (zh) | 一种基于pdf的复杂版面的标引方法 | |
US8452132B2 (en) | Automatic file name generation in OCR systems | |
CN110704570A (zh) | 一种连续页版式文档结构化信息提取方法 | |
US20150095769A1 (en) | Layout Analysis Method And System | |
CN106250830A (zh) | 数字图书结构化分析处理方法 | |
CN102270206A (zh) | 一种有效网页内容的抓取方法及装置 | |
EP2544099A1 (en) | Method for creating an enrichment file associated with a page of an electronic document | |
CN103176956B (zh) | 用于提取文档结构的方法和装置 | |
CN110633660B (zh) | 一种文档识别的方法、设备和存储介质 | |
CN106294304B (zh) | 版式文档注脚的自动识别及转换为流式文档注释的方法 | |
CN105159877A (zh) | 一种跨媒体自动排版系统及其方法 | |
CN104951429A (zh) | 版式电子文档的页眉页脚识别方法及装置 | |
CN105630817A (zh) | 一种电子发票内容解析的方法及系统 | |
WO2019041442A1 (zh) | 图表数据结构化提取方法、系统、电子设备及计算机可读存储介质 | |
CN100552670C (zh) | 一种自动识别数字文档版心的方法 | |
JP5380040B2 (ja) | 文書処理装置 | |
CN104978577B (zh) | 信息处理方法、装置及电子设备 | |
CN105488471A (zh) | 一种字形识别方法及装置 | |
JP5950700B2 (ja) | 画像処理装置、画像処理方法及びプログラム | |
JP2008108114A (ja) | 文書処理装置および文書処理方法 | |
CN104536947A (zh) | 版式文档的处理方法及装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
ASS | Succession or assignment of patent right |
Owner name: FOUNDER INFORMATION INDUSTRY HOLDING CO., LTD. BEI Free format text: FORMER OWNER: BEIJING FOUNDER APABI TECHNOLOGY CO., LTD. Effective date: 20130912 |
|
C41 | Transfer of patent application or patent right or utility model | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20130912 Address after: 100871 Beijing, Haidian District into the house road, founder of the building on the 5 floor, No. 298 Applicant after: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd. Applicant after: FOUNDER INFORMATION INDUSTRY HOLDINGS Co.,Ltd. Applicant after: FOUNDER APABI TECHNOLOGY Ltd. Address before: 100871 Beijing, Haidian District into the house road, founder of the building on the 5 floor, No. 298 Applicant before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd. Applicant before: FOUNDER APABI TECHNOLOGY Ltd. |
|
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address | ||
CP03 | Change of name, title or address |
Address after: 100871, Beijing, Haidian District Cheng Fu Road 298, founder building, 9 floor Patentee after: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd. Patentee after: PKU FOUNDER INFORMATION INDUSTRY GROUP CO.,LTD. Patentee after: FOUNDER APABI TECHNOLOGY Ltd. Address before: 100871, Beijing, Haidian District Cheng Fu Road 298, founder building, 5 floor Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd. Patentee before: FOUNDER INFORMATION INDUSTRY HOLDINGS Co.,Ltd. Patentee before: FOUNDER APABI TECHNOLOGY Ltd. |
|
TR01 | Transfer of patent right |
Effective date of registration: 20220919 Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031 Patentee after: New founder holdings development Co.,Ltd. Patentee after: FOUNDER APABI TECHNOLOGY Ltd. Address before: 100871, Beijing, Haidian District Cheng Fu Road 298, founder building, 9 floor Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd. Patentee before: PKU FOUNDER INFORMATION INDUSTRY GROUP CO.,LTD. Patentee before: FOUNDER APABI TECHNOLOGY Ltd. |
|
TR01 | Transfer of patent right |