CN103164388A - 一种版式文件中结构化信息获取的方法及装置 - Google Patents
一种版式文件中结构化信息获取的方法及装置 Download PDFInfo
- Publication number
- CN103164388A CN103164388A CN201110409463XA CN201110409463A CN103164388A CN 103164388 A CN103164388 A CN 103164388A CN 201110409463X A CN201110409463X A CN 201110409463XA CN 201110409463 A CN201110409463 A CN 201110409463A CN 103164388 A CN103164388 A CN 103164388A
- Authority
- CN
- China
- Prior art keywords
- character
- block structure
- information
- structure character
- article content
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 230000008878 coupling Effects 0.000 claims description 26
- 238000010168 coupling process Methods 0.000 claims description 26
- 238000005859 coupling reaction Methods 0.000 claims description 26
- 230000013011 mating Effects 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 230000010365 information processing Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 210000003205 muscle Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
- G06F40/143—Markup, e.g. Standard Generalized Markup Language [SGML] or Document Type Definition [DTD]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/14—Details of searching files based on file metadata
- G06F16/148—File search processing
- G06F16/152—File search processing using file content signatures, e.g. hash values
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2468—Fuzzy queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/131—Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Fuzzy Systems (AREA)
- Algebra (AREA)
- Probability & Statistics with Applications (AREA)
- Automation & Control Theory (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Document Processing Apparatus (AREA)
Abstract
Description
Claims (10)
Priority Applications (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110409463.XA CN103164388B (zh) | 2011-12-09 | 2011-12-09 | 一种版式文件中结构化信息获取的方法及装置 |
US14/119,109 US9773009B2 (en) | 2011-12-09 | 2012-12-07 | Methods and apparatus for obtaining structured information in fixed layout documents |
KR20137030609A KR20140053888A (ko) | 2011-12-09 | 2012-12-07 | 판식 파일중 구조화 정보 획득방법 및 장치 |
PCT/CN2012/086137 WO2013083067A1 (zh) | 2011-12-09 | 2012-12-07 | 一种版式文件中结构化信息获取的方法及装置 |
EP12855138.9A EP2790111A4 (en) | 2011-12-09 | 2012-12-07 | METHOD AND DEVICE FOR ACQUIRING STRUCTURED INFORMATION IN A LAYOUT FILE |
JP2014520525A JP5930496B2 (ja) | 2011-12-09 | 2012-12-07 | レイアウトファイルにおける構造化情報の取得方法及び装置 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110409463.XA CN103164388B (zh) | 2011-12-09 | 2011-12-09 | 一种版式文件中结构化信息获取的方法及装置 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103164388A true CN103164388A (zh) | 2013-06-19 |
CN103164388B CN103164388B (zh) | 2016-07-06 |
Family
ID=48573563
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110409463.XA Active CN103164388B (zh) | 2011-12-09 | 2011-12-09 | 一种版式文件中结构化信息获取的方法及装置 |
Country Status (6)
Country | Link |
---|---|
US (1) | US9773009B2 (zh) |
EP (1) | EP2790111A4 (zh) |
JP (1) | JP5930496B2 (zh) |
KR (1) | KR20140053888A (zh) |
CN (1) | CN103164388B (zh) |
WO (1) | WO2013083067A1 (zh) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104536948A (zh) * | 2014-12-10 | 2015-04-22 | 百度在线网络技术(北京)有限公司 | 版式文档的处理方法及装置 |
CN109684980A (zh) * | 2018-09-19 | 2019-04-26 | 腾讯科技(深圳)有限公司 | 自动阅卷方法及装置 |
CN110287465A (zh) * | 2019-06-22 | 2019-09-27 | 广州视源电子科技股份有限公司 | 文本处理方法、装置、设备及存储介质 |
CN110705503A (zh) * | 2019-10-14 | 2020-01-17 | 北京信息科技大学 | 生成目录结构化信息的方法和装置 |
CN111046064A (zh) * | 2019-12-23 | 2020-04-21 | 掌阅科技股份有限公司 | 图书版权信息的获取方法、电子设备及计算机存储介质 |
CN111414741A (zh) * | 2018-12-19 | 2020-07-14 | 北大方正集团有限公司 | 出版物的版式模板制作方法、装置、设备及介质 |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104346322B (zh) * | 2013-08-08 | 2018-07-10 | 北大方正集团有限公司 | 文档格式处理装置和文档格式处理方法 |
CN107330077B (zh) * | 2017-07-01 | 2020-07-14 | 广东电网有限责任公司信息中心 | 一种数字档案馆档案的检索方法 |
CN111176640B (zh) * | 2018-11-13 | 2022-05-13 | 武汉斗鱼网络科技有限公司 | Android工程中布局层级展现方法、存储介质、设备及系统 |
CN110196670A (zh) * | 2019-05-31 | 2019-09-03 | 数坤(北京)网络科技有限公司 | 一种文本生成方法、设备及计算机可读存储介质 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101354727A (zh) * | 2008-09-24 | 2009-01-28 | 北京大学 | 一种建立数字文档目录与正文之间链接的方法及装置 |
CN101739391A (zh) * | 2009-12-16 | 2010-06-16 | 彭扬 | 生成二进制文件格式电子书的方法及其生成的电子书 |
Family Cites Families (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW490643B (en) * | 1996-05-21 | 2002-06-11 | Hitachi Ltd | Estimated recognition device for input character string |
JPH11232439A (ja) * | 1998-02-16 | 1999-08-27 | Toshinari Hayashi | 文書画像構造解析方法 |
JP2001052116A (ja) * | 1999-08-06 | 2001-02-23 | Toshiba Corp | パターン列マッチング装置とパターン列マッチング方法と文字列マッチング装置と文字列マッチング方法 |
JP2001265762A (ja) * | 2000-03-21 | 2001-09-28 | Matsushita Electric Ind Co Ltd | 文書構造抽出装置及び文書構造情報抽出方法 |
EP1770547B1 (en) * | 2001-06-14 | 2012-09-12 | Sharp Kabushiki Kaisha | Data processing method, data processing program, and data processing apparatus |
JP2003288334A (ja) * | 2002-03-28 | 2003-10-10 | Toshiba Corp | 文書処理装置及び文書処理方法 |
US7142728B2 (en) * | 2002-05-17 | 2006-11-28 | Science Applications International Corporation | Method and system for extracting information from a document |
US7240047B2 (en) * | 2002-12-23 | 2007-07-03 | Hewlett-Packard Development Company, L.P. | Apparatus and method for market-based document layout selection |
US7383500B2 (en) * | 2004-04-30 | 2008-06-03 | Microsoft Corporation | Methods and systems for building packages that contain pre-paginated documents |
JP2006163651A (ja) * | 2004-12-03 | 2006-06-22 | Sony Computer Entertainment Inc | 表示装置、表示装置の制御方法、プログラム及びフォントデータ |
US7721198B2 (en) * | 2006-01-31 | 2010-05-18 | Microsoft Corporation | Story tracking for fixed layout markup documents |
US7676741B2 (en) * | 2006-01-31 | 2010-03-09 | Microsoft Corporation | Structural context for fixed layout markup documents |
US7917493B2 (en) * | 2007-04-19 | 2011-03-29 | Retrevo Inc. | Indexing and searching product identifiers |
CN101571859B (zh) | 2008-04-28 | 2013-01-02 | 国际商业机器公司 | 用于对文档进行标注的方法和设备 |
CN101458680B (zh) | 2008-09-03 | 2010-12-01 | 北京大学 | 一种自动识别数字文档目录的方法及装置 |
JP2010157107A (ja) * | 2008-12-26 | 2010-07-15 | Hitachi Software Eng Co Ltd | 業務文書処理装置 |
US8254681B1 (en) * | 2009-02-05 | 2012-08-28 | Google Inc. | Display of document image optimized for reading |
WO2011036830A1 (ja) * | 2009-09-24 | 2011-03-31 | 日本電気株式会社 | 単語認識装置、方法及びプログラムが格納された非一時的なコンピュータ可読媒体並びに発送物区分装置 |
WO2012057891A1 (en) * | 2010-10-26 | 2012-05-03 | Hewlett-Packard Development Company, L.P. | Transformation of a document into interactive media content |
US8645819B2 (en) * | 2011-06-17 | 2014-02-04 | Xerox Corporation | Detection and extraction of elements constituting images in unstructured document files |
-
2011
- 2011-12-09 CN CN201110409463.XA patent/CN103164388B/zh active Active
-
2012
- 2012-12-07 WO PCT/CN2012/086137 patent/WO2013083067A1/zh active Application Filing
- 2012-12-07 US US14/119,109 patent/US9773009B2/en active Active
- 2012-12-07 KR KR20137030609A patent/KR20140053888A/ko not_active Application Discontinuation
- 2012-12-07 JP JP2014520525A patent/JP5930496B2/ja not_active Expired - Fee Related
- 2012-12-07 EP EP12855138.9A patent/EP2790111A4/en not_active Withdrawn
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101354727A (zh) * | 2008-09-24 | 2009-01-28 | 北京大学 | 一种建立数字文档目录与正文之间链接的方法及装置 |
CN101739391A (zh) * | 2009-12-16 | 2010-06-16 | 彭扬 | 生成二进制文件格式电子书的方法及其生成的电子书 |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104536948A (zh) * | 2014-12-10 | 2015-04-22 | 百度在线网络技术(北京)有限公司 | 版式文档的处理方法及装置 |
CN109684980A (zh) * | 2018-09-19 | 2019-04-26 | 腾讯科技(深圳)有限公司 | 自动阅卷方法及装置 |
CN109684980B (zh) * | 2018-09-19 | 2022-12-13 | 腾讯科技(深圳)有限公司 | 自动阅卷方法及装置 |
CN111414741A (zh) * | 2018-12-19 | 2020-07-14 | 北大方正集团有限公司 | 出版物的版式模板制作方法、装置、设备及介质 |
CN111414741B (zh) * | 2018-12-19 | 2022-06-14 | 北大方正集团有限公司 | 出版物的版式模板制作方法、装置、设备及介质 |
CN110287465A (zh) * | 2019-06-22 | 2019-09-27 | 广州视源电子科技股份有限公司 | 文本处理方法、装置、设备及存储介质 |
CN110705503A (zh) * | 2019-10-14 | 2020-01-17 | 北京信息科技大学 | 生成目录结构化信息的方法和装置 |
CN110705503B (zh) * | 2019-10-14 | 2022-02-25 | 北京信息科技大学 | 生成目录结构化信息的方法和装置 |
CN111046064A (zh) * | 2019-12-23 | 2020-04-21 | 掌阅科技股份有限公司 | 图书版权信息的获取方法、电子设备及计算机存储介质 |
CN111046064B (zh) * | 2019-12-23 | 2023-05-19 | 掌阅科技股份有限公司 | 图书版权信息的获取方法、电子设备及计算机存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN103164388B (zh) | 2016-07-06 |
KR20140053888A (ko) | 2014-05-08 |
JP2014527660A (ja) | 2014-10-16 |
EP2790111A1 (en) | 2014-10-15 |
US20140289274A1 (en) | 2014-09-25 |
WO2013083067A1 (zh) | 2013-06-13 |
US9773009B2 (en) | 2017-09-26 |
EP2790111A4 (en) | 2015-12-09 |
JP5930496B2 (ja) | 2016-06-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103164388A (zh) | 一种版式文件中结构化信息获取的方法及装置 | |
CN107729480B (zh) | 一种限定区域的文本信息抽取方法及装置 | |
CN104598577B (zh) | 一种网页正文的提取方法 | |
CN103123618B (zh) | 文本相似度获取方法和装置 | |
CN101706807B (zh) | 一种中文网页新词自动获取方法 | |
KR102157202B1 (ko) | 정보 마이닝 방법, 시스템, 전자장치 및 판독 가능한 저장매체 | |
CN102831131B (zh) | 构建标注网页语料库的方法及装置 | |
CN102207946B (zh) | 一种知识网络的半自动生成方法 | |
CN103699585A (zh) | 文件的元数据存储以及文件恢复的方法、装置和系统 | |
CN103324609A (zh) | 文本校对装置和文本校对方法 | |
CN102945244A (zh) | 基于句号特征字串的中文网页重复文档检测和过滤方法 | |
CN102955773B (zh) | 用于在中文文档中识别化学名称的方法及系统 | |
CN102651002A (zh) | 一种网页信息抽取方法及其系统 | |
CN105589894B (zh) | 文档索引建立方法和装置、文档检索方法和装置 | |
CN103324622A (zh) | 一种自动生成首页摘要的方法及装置 | |
CN102200968A (zh) | 一种excel表格数据排重的方法和装置 | |
CN110909168A (zh) | 知识图谱的更新方法和装置、存储介质及电子装置 | |
CN105488471A (zh) | 一种字形识别方法及装置 | |
CN102663108A (zh) | 基于复杂网络模型并行化标签传播算法的药物社团发现方法 | |
CN103927176A (zh) | 一种基于层次主题模型的程序特征树的生成方法 | |
CN100562872C (zh) | 针对结构化网页的自动模板信息定位方法 | |
CN113836272A (zh) | 关键信息的展示方法、系统、计算机设备及可读存储介质 | |
CN102662953B (zh) | 与输入法集成的语义标注系统和方法 | |
CN103810213A (zh) | 一种搜索方法和系统 | |
CN105608137A (zh) | 一种提取身份标识的方法和装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
ASS | Succession or assignment of patent right |
Owner name: FOUNDER INFORMATION INDUSTRY HOLDING CO., LTD. BEI Free format text: FORMER OWNER: BEIJING FOUNDER APABI TECHNOLOGY CO., LTD. Effective date: 20130902 |
|
C41 | Transfer of patent application or patent right or utility model | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20130902 Address after: 100871 Beijing, Haidian District into the house road, founder of the building on the 9 floor, No. 298 Applicant after: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd. Applicant after: FOUNDER INFORMATION INDUSTRY HOLDINGS Co.,Ltd. Applicant after: FOUNDER APABI TECHNOLOGY Ltd. Address before: 100871 Beijing, Haidian District into the house road, founder of the building on the 9 floor, No. 298 Applicant before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd. Applicant before: FOUNDER APABI TECHNOLOGY Ltd. |
|
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CP01 | Change in the name or title of a patent holder | ||
CP01 | Change in the name or title of a patent holder |
Address after: 100871, Beijing, Haidian District Cheng Fu Road 298, founder building, 9 floor Patentee after: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd. Patentee after: PKU FOUNDER INFORMATION INDUSTRY GROUP CO.,LTD. Patentee after: FOUNDER APABI TECHNOLOGY Ltd. Address before: 100871, Beijing, Haidian District Cheng Fu Road 298, founder building, 9 floor Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd. Patentee before: FOUNDER INFORMATION INDUSTRY HOLDINGS Co.,Ltd. Patentee before: FOUNDER APABI TECHNOLOGY Ltd. |
|
TR01 | Transfer of patent right |
Effective date of registration: 20220921 Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031 Patentee after: New founder holdings development Co.,Ltd. Patentee after: FOUNDER APABI TECHNOLOGY Ltd. Address before: 100871, Beijing, Haidian District Cheng Fu Road 298, founder building, 9 floor Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd. Patentee before: PKU FOUNDER INFORMATION INDUSTRY GROUP CO.,LTD. Patentee before: FOUNDER APABI TECHNOLOGY Ltd. |
|
TR01 | Transfer of patent right |