CN109492211A - A kind of table extracting method based on OFD document - Google Patents

A kind of table extracting method based on OFD document Download PDF

Info

Publication number
CN109492211A
CN109492211A CN201811343405.XA CN201811343405A CN109492211A CN 109492211 A CN109492211 A CN 109492211A CN 201811343405 A CN201811343405 A CN 201811343405A CN 109492211 A CN109492211 A CN 109492211A
Authority
CN
China
Prior art keywords
data
list
module
management module
objects
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811343405.XA
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangxi Jinge Polytron Technologies Inc
Original Assignee
Jiangxi Jinge Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangxi Jinge Polytron Technologies Inc filed Critical Jiangxi Jinge Polytron Technologies Inc
Priority to CN201811343405.XA priority Critical patent/CN109492211A/en
Publication of CN109492211A publication Critical patent/CN109492211A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines

Abstract

The invention discloses a kind of table extracting methods based on OFD document, functional sequence and data application function process are parsed including data, application layer and logical layer are separated, so that two parts is independent of each other, solves the problems, such as because document format and application environment factor lead to not extract table from OFD document.Realize the function of OFD document table extraction.Framework is clear, is easy to understand, is easy to accomplish, debugging routine and later period is facilitated to tie up, shield enhances the scalability of extractive technique.Based entirely on the reference format of OFD document, other software supports in addition to the method for the present invention are hardly needed, cost is reduced.It can achieve lossless carry out table extraction.And the relevant informations such as text in extractable and editor's table.

Description

A kind of table extracting method based on OFD document
Technical field
The present invention relates to the processing technique of electronic file form, in particular to a kind of table extraction side based on OFD document Method.
Background technique
Format document format is that a variety of digital content objects such as text, figure, image are carried out the space of a whole page according to certain rule Solidify a kind of format presented.
OFD(open fixed layout document) document be by national independent research, independently of software, hardware, The format document format of operating system, output equipment.
The increase of demand and application dynamics with country to OFD format document, present all trades and professions are to OFD document Use also more and more frequently.OFD document at present, other than the reading of document content, there are also document annotation, editor bookmark, Cover E-seal, editor's functions such as attachment, and government and institutional settings for the written instructions of OFD document and stamping especially Frequently.
Definition due to OFD document and format limitation, purpose of design be in order to read and print document, and and reader Interaction it is just weaker, wherein the extraction just comprising contents such as image, table, texts.Current reader all can only be relatively light Pine extracts the word content in OFD document, and to image or table is extracted, it is low to be constantly present low efficiency, accuracy Or the problems such as cannot achieve.Although table can be extracted by way of screen interception at present, screen interception mode is because relating to And hardware device DPI is arrived, picture is easily distorted, and the table content that screen interception goes out can not be chosen and can not be edited.
Summary of the invention
In order to solve the above-mentioned technical problem the present invention, provides a kind of table extracting method based on OFD document, with OFD text Shelves standard criterion realizes extraction form function by the parsing to OFD document content for core in OFD document, can make The interactivity of OFD document and other application program is stronger, promotes the application of OFD document, while can also promote working efficiency.
Above-mentioned purpose is implemented with the following technical solutions in the present invention.A kind of table extracting method based on OFD document, including Application interface module, data management module, data extraction module and data resolution module, which is characterized in that
The application interface module: providing straightaway interface function, calls to upper layer application;Meanwhile being responsible for calling data Management module is to realize interface function;
The data management module: for the scheduling to data extraction module and data resolution module, while for summarizing by counting According to the data content of extraction module and data resolution module, and data are consigned into application interface module and are used;
The data extraction module: parsing OFD document, all data is extracted from OFD document, and it is pressed table number According to object and non-list data object classification, and uniformly gives classification data to data management module and carry out unified distribution management;
The data resolution module: getting list data object from data management module, by table border lookup algorithm, All list data objects are subjected to unified classification processing, obtain table objects, and table objects are formed into Table List, and will Table List gives data management module.
Further, the list data: for the basic element for constituting table, it will be appreciated that be line segment, it includes have line segment Starting point X axis coordinate and Y axis coordinate, terminating point X axis coordinate and Y axis coordinate, and whether be dotted line data content.
Further, the list data object: for a kind of list data set.
Further, the non-list data object: for the non-segment data content of text, picture and annotation.
Further, the table objects: for the complete table being made of table objects data.
Further, the Table List: for the list being made of table objects.
Further, the detailed process of the data resolution module is as follows:
1) all list data objects are traversed, the X axis coordinate and Y axis coordinate of taking out its vertex are the list data pair of minimum value As;Two or more identical data objects if it exists, then take first list data object found, and by this table Lattice data object is defined as baseline;
2) baseline is stored in an interim table objects;
3) traversal searches all list data objects, and all apex coordinates and baseline have the list data pair of intersection in taking-up As, and the list data object of taking-up is stored in the temporary table object in step 2;
4) it is baseline by found in step 3) first list data object definition, step 3) is repeated, until having traversed institute There is list data object;
5) four apex coordinates of temporary table object are taken out;
6) all non-list data objects are traversed, all coordinates are non-within the scope of four apex coordinates in step 5) in taking-up List data object, and save it in temporary table object;
7) table objects list is created, temporary table object is stored in table objects list;
8) table objects list is given to data management module to save.
A kind of table extracting method based on OFD document further includes data parsing functional sequence and data application function stream Journey:
The data parsing functional sequence is as follows:
1) start application interface module: creation application interface module object, application program is using application interface module to external Mouth function;
2) log-on data management module: creation data management module object, data management module are started to work;
3) log-on data extraction module: creation data extraction module object is distributed the content of its work by data management module;
4) log-on data parsing module: creation data resolution module object is distributed the content of its work by data management module;
The data application functional sequence is as follows:
1) call application interface module: application program is to application interface module invoking performance function and incoming parameter, to obtain List data;
2) call data management module: data interface module calls data management module, according to the parameter that application program is passed to, uses To judge whether it is effective list data;
3) return the result: data management module returns to application program by the matching to incoming parameter, by matching result.
The present invention separates application layer and logical layer, and two parts is made to be independent of each other, and solves because of document format and application Program environment factor leads to not the problem of table is extracted from OFD document.Realize the function of OFD document table extraction.Frame Structure is clear, is easy to understand, is easy to accomplish, facilitating debugging routine and later maintenance, enhancing the scalability of extractive technique.It is complete Reference format entirely based on OFD document hardly needs other software supports in addition to the method for the present invention, reduces cost.It can be with Reach lossless carry out table extraction, and the relevant informations such as text in extractable and editor's table.The above effect is that screen is cut The mode for extracting table is taken to be unable to reach.
Detailed description of the invention
Fig. 1 is that data parse functional flow diagram in the present invention;
Fig. 2 is data application functional flow diagram in the present invention;
Fig. 3 is the flow chart of data resolution module in the present invention.
Specific embodiment
Below in conjunction with attached drawing, the invention will be further described.Referring to Fig. 1 to Fig. 3, a kind of table based on OFD document is mentioned Take method, including data parsing functional sequence and data application function process, wherein data parsing functional sequence (as shown in Figure 1) The following steps are included:
Application program 101: for calling the application of the method for the present invention, form is unlimited, can be executable program and is also possible to Dynamic base.
Start application interface module 102: application interface module object being created by application program, keeps application program normal The interface function for calling application interface module to provide.
Log-on data management module 103: by application interface module create data management module object, data management module its Effect is to manage data extraction module and data resolution module, and saves data extraction module and data resolution module provides Data result, in order to which data result is passed to application program by application interface module.
Create raw data list and list data list 104: data management module creates raw data list and table Data list.The number for all doubtful tables for extracting storage from OFD document by data extraction module in raw data list According to;It will be stored in list data list after being parsed by data resolution module to raw data list, the data of all tables.
Log-on data extraction module 105: data management module creates data extraction module object, its work of data extraction module With being to extract the data content of doubtful table in OFD document, and data are stored in raw data list.
Log-on data parsing module 106: data management module creates data resolution module object, its work of data resolution module With being to parse raw data list, all list data contents are therefrom obtained, and data deposit list data is arranged Table.
Parse OFD document 107: data extraction module parses specified OFD document, extracts all doubtful tables Data content.
Data are added to raw data list 108: by data extraction module by the data content of all doubtful tables, with column In the form deposit raw data list of table.
Parsing raw data list 109: parsing raw data list by data resolution module, therefrom obtains all List data content.
Data are added to list data list 110: by data resolution module by all list data contents, with the shape of list Formula is stored in list data list.
Terminate 111: data process of analysis terminates.
Data application functional sequence (as shown in Figure 2) its form is unlimited, can be executable program and is also possible to dynamic base.
Application interface module 202: the corresponding interface function for calling application interface module to provide by application program 201, with reality Existing corresponding function.
Page number and coordinate 203: the OFD where providing the table for wanting to extract to application interface module 202 from application program 201 Document page number and coordinate value relative to page number.
Data management module 204: page number and coordinate are transmitted to data management module by application interface module.
List data list 205: the data list of all tables in current OFD document is stored in list data list.
Matching: carrying out matching operation for page number and coordinate by data management module in list data list, should with confirmation Whether page number and coordinate belong to list data.206 successes are matched, then are table 208, are otherwise non-table 207.
It returns result to application program 209: matching result being returned into application program by data management module, if result is Table then returns to 208 data of table, if result is non-table 207, returns and unsuccessfully identifies.
Terminate 210: data application process terminates.
The process (as shown in Figure 3) of data resolution module in the present invention:
Log-on data extraction module 301: data extraction module object is created by data management module.
Traversal table data object 302: from list data list object, each list data object is accessed one by one.
Traversal terminates 303: all list data objects all access completion.
Obtain baseline 304: traversal table data object is exactly to obtain baseline, and baseline is for determining a table pair The basis of elephant.
Temporary table object 305: the scratchpad area (SPA) of table objects.
Creation temporary table object 306: it if not creating temporary table object, creates.
Baseline is saved to temporary table object 307: base-line data is saved in temporary table object.
Obtain the list data object 308 intersected with baseline: from list data list object, obtaining has intersection with baseline List data object.
List data object is saved to temporary table object 309: there will be the list data object of intersection all to save with baseline Into temporary table object.
It obtains temporary table object vertex coordinate 310: from temporary table object, taking out four apex coordinates of table.
Create table objects list 311: the memory block of table objects.
It traverses non-list data object 312: from data object list, accessing data object one by one, select non-table number According to object.
In temporary table object range 313: being parsed to the data object of access, judge data object whether interim In the region of table objects.
Non- list data object is saved to temporary table object 314: if not list data within the scope of table, then by non-table Lattice data are stored in temporary table object.
Traversal terminates 315: terminating the access to data object list.
Temporary table object is saved to table objects list 316: by temporary table contents of object, being saved in table objects column In table.
It gives table objects list to data management module 317: by the content of table objects list, being sent to data management Module.
Terminate 318: completing the function of data extraction module.

Claims (9)

1. a kind of table extracting method based on OFD document, including application interface module, data management module, data extract mould Block and data resolution module, which is characterized in that
The application interface module: providing straightaway interface function, calls to upper layer application;Meanwhile being responsible for calling data Management module is to realize interface function;
The data management module: for the scheduling to data extraction module and data resolution module, while for summarizing by counting According to the data content of extraction module and data resolution module, and data are consigned into application interface module and are used;
The data extraction module: parsing OFD document, all data is extracted from OFD document, and it is pressed table number According to object and non-list data object classification, and uniformly gives classification data to data management module and carry out unified distribution management;
The data resolution module: getting list data object from data management module, by table border lookup algorithm, All list data objects are subjected to unified classification processing, obtain table objects, and table objects are formed into Table List, and will Table List gives data management module.
2. the table extracting method according to claim 1 based on OFD document, which is characterized in that the list data: for Constitute the basic element of table, it will be appreciated that be line segment, it includes the starting point X axis coordinate and Y axis coordinate that have line segment, terminating point X Axial coordinate and Y axis coordinate, and whether be dotted line data content.
3. the table extracting method according to claim 1 based on OFD document, which is characterized in that the list data pair As: for a kind of list data set.
4. the table extracting method according to claim 1 based on OFD document, which is characterized in that the non-list data Object: for the non-segment data content of text, picture and annotation.
5. the table extracting method according to claim 1 based on OFD document, which is characterized in that the table objects: for The complete table being made of table objects data.
6. the table extracting method according to claim 1 based on OFD document, which is characterized in that the Table List: for The list being made of table objects.
7. the table extracting method according to claim 1 based on OFD document, which is characterized in that the data parse mould The detailed process of block is as follows:
1) all list data objects are traversed, the X axis coordinate and Y axis coordinate of taking out its vertex are the list data pair of minimum value As.
8. two or more identical data objects if it exists then take first list data object found, and will This list data object definition is baseline;
2) baseline is stored in an interim table objects;
3) traversal searches all list data objects, and all apex coordinates and baseline have the list data pair of intersection in taking-up As, and the list data object of taking-up is stored in the temporary table object in step 2;
4) it is baseline by found in step 3) first list data object definition, step 3) is repeated, until having traversed institute There is list data object;
5) four apex coordinates of temporary table object are taken out;
6) all non-list data objects are traversed, all coordinates are non-within the scope of four apex coordinates in step 5) in taking-up List data object, and save it in temporary table object;
7) table objects list is created, temporary table object is stored in table objects list;
8) table objects list is given to data management module to save.
9. a kind of table extracting method as described in claim 1 based on OFD document, which is characterized in that further include data solution Analyse functional sequence and data application function process:
The data parsing functional sequence is as follows:
1) start application interface module: creation application interface module object, application program is using application interface module to external Mouth function;
2) log-on data management module: creation data management module object, data management module are started to work;
3) log-on data extraction module: creation data extraction module object is distributed the content of its work by data management module;
4) log-on data parsing module: creation data resolution module object is distributed the content of its work by data management module;
The data application functional sequence is as follows:
1) call application interface module: application program is to application interface module invoking performance function and incoming parameter, to obtain List data;
2) call data management module: data interface module calls data management module, according to the parameter that application program is passed to, uses To judge whether it is effective list data;
3) return the result: data management module returns to application program by the matching to incoming parameter, by matching result.
CN201811343405.XA 2018-11-13 2018-11-13 A kind of table extracting method based on OFD document Pending CN109492211A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811343405.XA CN109492211A (en) 2018-11-13 2018-11-13 A kind of table extracting method based on OFD document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811343405.XA CN109492211A (en) 2018-11-13 2018-11-13 A kind of table extracting method based on OFD document

Publications (1)

Publication Number Publication Date
CN109492211A true CN109492211A (en) 2019-03-19

Family

ID=65694795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811343405.XA Pending CN109492211A (en) 2018-11-13 2018-11-13 A kind of table extracting method based on OFD document

Country Status (1)

Country Link
CN (1) CN109492211A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110119502A (en) * 2019-05-13 2019-08-13 江西金格科技股份有限公司 A method of dynamic table single domain is realized based on OFD document
CN111898433A (en) * 2020-06-22 2020-11-06 百望股份有限公司 Paper bill digitization method and device
CN116384356A (en) * 2023-06-02 2023-07-04 福昕鲲鹏(北京)信息科技有限公司 Method, device, equipment and medium for creating form row of OFD file

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1858786A (en) * 2006-06-09 2006-11-08 宋丽娟 Electronic file formatting annotate and comment system and method
CN101206568A (en) * 2007-12-07 2008-06-25 华中科技大学 Gridding application program interface system based on Web
CN103399857A (en) * 2013-07-01 2013-11-20 北京航空航天大学 General method for extracting document structural information
CN104346322A (en) * 2013-08-08 2015-02-11 北大方正集团有限公司 Document format processing device and document format processing method
CN105988979A (en) * 2015-02-16 2016-10-05 北京邮电大学 Form extraction method and device based on PDF (Portable Document Format) file
CN106897690A (en) * 2017-02-22 2017-06-27 南京述酷信息技术有限公司 PDF table extracting methods
CN107622230A (en) * 2017-08-30 2018-01-23 中国科学院软件研究所 A kind of PDF list data analytic methods based on region recognition with segmentation
CN108564990A (en) * 2018-04-11 2018-09-21 泰山医学院 Doctor, which supports, combines data pick-up synchronization system and method, information data processing terminal

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1858786A (en) * 2006-06-09 2006-11-08 宋丽娟 Electronic file formatting annotate and comment system and method
CN101206568A (en) * 2007-12-07 2008-06-25 华中科技大学 Gridding application program interface system based on Web
CN103399857A (en) * 2013-07-01 2013-11-20 北京航空航天大学 General method for extracting document structural information
CN104346322A (en) * 2013-08-08 2015-02-11 北大方正集团有限公司 Document format processing device and document format processing method
CN105988979A (en) * 2015-02-16 2016-10-05 北京邮电大学 Form extraction method and device based on PDF (Portable Document Format) file
CN106897690A (en) * 2017-02-22 2017-06-27 南京述酷信息技术有限公司 PDF table extracting methods
CN107622230A (en) * 2017-08-30 2018-01-23 中国科学院软件研究所 A kind of PDF list data analytic methods based on region recognition with segmentation
CN108564990A (en) * 2018-04-11 2018-09-21 泰山医学院 Doctor, which supports, combines data pick-up synchronization system and method, information data processing terminal

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110119502A (en) * 2019-05-13 2019-08-13 江西金格科技股份有限公司 A method of dynamic table single domain is realized based on OFD document
CN111898433A (en) * 2020-06-22 2020-11-06 百望股份有限公司 Paper bill digitization method and device
CN111898433B (en) * 2020-06-22 2024-04-09 百望股份有限公司 Paper bill digitizing method and device
CN116384356A (en) * 2023-06-02 2023-07-04 福昕鲲鹏(北京)信息科技有限公司 Method, device, equipment and medium for creating form row of OFD file
CN116384356B (en) * 2023-06-02 2023-08-22 福昕鲲鹏(北京)信息科技有限公司 Method, device, equipment and medium for creating form row of OFD file

Similar Documents

Publication Publication Date Title
CN101025738B (en) Template-free dynamic website generating method
CN105930159B (en) A kind of method and system that the GUI code based on image generates
CN103186510B (en) A kind of method and apparatus of convert documents form
US8892990B2 (en) Automatic creation of a table and query tools
CN109492211A (en) A kind of table extracting method based on OFD document
US20170371844A1 (en) Method, device and terminal for implementing regional screen capture
CN109829139B (en) Method and device for converting DOC/DOCX format streaming file into OFD format file
CN109492199B (en) PDF file conversion method based on OCR pre-judgment
US20130191732A1 (en) Fixed Format Document Conversion Engine
US20130326341A1 (en) Digital comic editor, method and non-transitorycomputer-readable medium
WO2019041442A1 (en) Method and system for structural extraction of figure data, electronic device, and computer readable storage medium
CN107633055B (en) Method for converting picture into HTML document
WO2020233023A1 (en) Psd file editing method implemented based on layering technology, and electronic device
CN105975446A (en) Method and system for displaying word document content by modules in mobile phone terminal
CN108389244B (en) Implementation method for rendering flash rich text according to specified character rules
CN106776994B (en) Application method and system of engineering symbols in engineering report forms and web pages
CN110377371B (en) Style sheet system management method based on Web tag
CN110310226B (en) Picture mosaic display method and system
CN111190519A (en) File and control processing method, device, equipment and storage medium thereof
CN109271616A (en) A kind of intelligent extract method based on normative document questions record characteristic value
CN107066438A (en) A kind of method for editing text and device, electronic equipment
CN111859886B (en) Document generation method and device based on product prototype interface
CN111274156B (en) Automatic identification method and device compatible with multi-frame pages
CN109635729A (en) A kind of Table recognition method and terminal
CN115268904A (en) User interface design file generation method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190319

WD01 Invention patent application deemed withdrawn after publication