CN109492211A - A kind of table extracting method based on OFD document - Google Patents
A kind of table extracting method based on OFD document Download PDFInfo
- Publication number
- CN109492211A CN109492211A CN201811343405.XA CN201811343405A CN109492211A CN 109492211 A CN109492211 A CN 109492211A CN 201811343405 A CN201811343405 A CN 201811343405A CN 109492211 A CN109492211 A CN 109492211A
- Authority
- CN
- China
- Prior art keywords
- data
- list
- module
- management module
- objects
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000000605 extraction Methods 0.000 claims abstract description 9
- 239000000284 extract Substances 0.000 claims abstract description 4
- 238000013523 data management Methods 0.000 claims description 43
- 238000013075 data extraction Methods 0.000 claims description 20
- 238000012545 processing Methods 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 claims description 2
- 238000007726 management method Methods 0.000 claims description 2
- 230000006870 function Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 241000036318 Callitris preissii Species 0.000 description 1
- 241000406668 Loxodonta cyclotis Species 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/177—Editing, e.g. inserting or deleting of tables; using ruled lines
Abstract
The invention discloses a kind of table extracting methods based on OFD document, functional sequence and data application function process are parsed including data, application layer and logical layer are separated, so that two parts is independent of each other, solves the problems, such as because document format and application environment factor lead to not extract table from OFD document.Realize the function of OFD document table extraction.Framework is clear, is easy to understand, is easy to accomplish, debugging routine and later period is facilitated to tie up, shield enhances the scalability of extractive technique.Based entirely on the reference format of OFD document, other software supports in addition to the method for the present invention are hardly needed, cost is reduced.It can achieve lossless carry out table extraction.And the relevant informations such as text in extractable and editor's table.
Description
Technical field
The present invention relates to the processing technique of electronic file form, in particular to a kind of table extraction side based on OFD document
Method.
Background technique
Format document format is that a variety of digital content objects such as text, figure, image are carried out the space of a whole page according to certain rule
Solidify a kind of format presented.
OFD(open fixed layout document) document be by national independent research, independently of software, hardware,
The format document format of operating system, output equipment.
The increase of demand and application dynamics with country to OFD format document, present all trades and professions are to OFD document
Use also more and more frequently.OFD document at present, other than the reading of document content, there are also document annotation, editor bookmark,
Cover E-seal, editor's functions such as attachment, and government and institutional settings for the written instructions of OFD document and stamping especially
Frequently.
Definition due to OFD document and format limitation, purpose of design be in order to read and print document, and and reader
Interaction it is just weaker, wherein the extraction just comprising contents such as image, table, texts.Current reader all can only be relatively light
Pine extracts the word content in OFD document, and to image or table is extracted, it is low to be constantly present low efficiency, accuracy
Or the problems such as cannot achieve.Although table can be extracted by way of screen interception at present, screen interception mode is because relating to
And hardware device DPI is arrived, picture is easily distorted, and the table content that screen interception goes out can not be chosen and can not be edited.
Summary of the invention
In order to solve the above-mentioned technical problem the present invention, provides a kind of table extracting method based on OFD document, with OFD text
Shelves standard criterion realizes extraction form function by the parsing to OFD document content for core in OFD document, can make
The interactivity of OFD document and other application program is stronger, promotes the application of OFD document, while can also promote working efficiency.
Above-mentioned purpose is implemented with the following technical solutions in the present invention.A kind of table extracting method based on OFD document, including
Application interface module, data management module, data extraction module and data resolution module, which is characterized in that
The application interface module: providing straightaway interface function, calls to upper layer application;Meanwhile being responsible for calling data
Management module is to realize interface function;
The data management module: for the scheduling to data extraction module and data resolution module, while for summarizing by counting
According to the data content of extraction module and data resolution module, and data are consigned into application interface module and are used;
The data extraction module: parsing OFD document, all data is extracted from OFD document, and it is pressed table number
According to object and non-list data object classification, and uniformly gives classification data to data management module and carry out unified distribution management;
The data resolution module: getting list data object from data management module, by table border lookup algorithm,
All list data objects are subjected to unified classification processing, obtain table objects, and table objects are formed into Table List, and will
Table List gives data management module.
Further, the list data: for the basic element for constituting table, it will be appreciated that be line segment, it includes have line segment
Starting point X axis coordinate and Y axis coordinate, terminating point X axis coordinate and Y axis coordinate, and whether be dotted line data content.
Further, the list data object: for a kind of list data set.
Further, the non-list data object: for the non-segment data content of text, picture and annotation.
Further, the table objects: for the complete table being made of table objects data.
Further, the Table List: for the list being made of table objects.
Further, the detailed process of the data resolution module is as follows:
1) all list data objects are traversed, the X axis coordinate and Y axis coordinate of taking out its vertex are the list data pair of minimum value
As;Two or more identical data objects if it exists, then take first list data object found, and by this table
Lattice data object is defined as baseline;
2) baseline is stored in an interim table objects;
3) traversal searches all list data objects, and all apex coordinates and baseline have the list data pair of intersection in taking-up
As, and the list data object of taking-up is stored in the temporary table object in step 2;
4) it is baseline by found in step 3) first list data object definition, step 3) is repeated, until having traversed institute
There is list data object;
5) four apex coordinates of temporary table object are taken out;
6) all non-list data objects are traversed, all coordinates are non-within the scope of four apex coordinates in step 5) in taking-up
List data object, and save it in temporary table object;
7) table objects list is created, temporary table object is stored in table objects list;
8) table objects list is given to data management module to save.
A kind of table extracting method based on OFD document further includes data parsing functional sequence and data application function stream
Journey:
The data parsing functional sequence is as follows:
1) start application interface module: creation application interface module object, application program is using application interface module to external
Mouth function;
2) log-on data management module: creation data management module object, data management module are started to work;
3) log-on data extraction module: creation data extraction module object is distributed the content of its work by data management module;
4) log-on data parsing module: creation data resolution module object is distributed the content of its work by data management module;
The data application functional sequence is as follows:
1) call application interface module: application program is to application interface module invoking performance function and incoming parameter, to obtain
List data;
2) call data management module: data interface module calls data management module, according to the parameter that application program is passed to, uses
To judge whether it is effective list data;
3) return the result: data management module returns to application program by the matching to incoming parameter, by matching result.
The present invention separates application layer and logical layer, and two parts is made to be independent of each other, and solves because of document format and application
Program environment factor leads to not the problem of table is extracted from OFD document.Realize the function of OFD document table extraction.Frame
Structure is clear, is easy to understand, is easy to accomplish, facilitating debugging routine and later maintenance, enhancing the scalability of extractive technique.It is complete
Reference format entirely based on OFD document hardly needs other software supports in addition to the method for the present invention, reduces cost.It can be with
Reach lossless carry out table extraction, and the relevant informations such as text in extractable and editor's table.The above effect is that screen is cut
The mode for extracting table is taken to be unable to reach.
Detailed description of the invention
Fig. 1 is that data parse functional flow diagram in the present invention;
Fig. 2 is data application functional flow diagram in the present invention;
Fig. 3 is the flow chart of data resolution module in the present invention.
Specific embodiment
Below in conjunction with attached drawing, the invention will be further described.Referring to Fig. 1 to Fig. 3, a kind of table based on OFD document is mentioned
Take method, including data parsing functional sequence and data application function process, wherein data parsing functional sequence (as shown in Figure 1)
The following steps are included:
Application program 101: for calling the application of the method for the present invention, form is unlimited, can be executable program and is also possible to
Dynamic base.
Start application interface module 102: application interface module object being created by application program, keeps application program normal
The interface function for calling application interface module to provide.
Log-on data management module 103: by application interface module create data management module object, data management module its
Effect is to manage data extraction module and data resolution module, and saves data extraction module and data resolution module provides
Data result, in order to which data result is passed to application program by application interface module.
Create raw data list and list data list 104: data management module creates raw data list and table
Data list.The number for all doubtful tables for extracting storage from OFD document by data extraction module in raw data list
According to;It will be stored in list data list after being parsed by data resolution module to raw data list, the data of all tables.
Log-on data extraction module 105: data management module creates data extraction module object, its work of data extraction module
With being to extract the data content of doubtful table in OFD document, and data are stored in raw data list.
Log-on data parsing module 106: data management module creates data resolution module object, its work of data resolution module
With being to parse raw data list, all list data contents are therefrom obtained, and data deposit list data is arranged
Table.
Parse OFD document 107: data extraction module parses specified OFD document, extracts all doubtful tables
Data content.
Data are added to raw data list 108: by data extraction module by the data content of all doubtful tables, with column
In the form deposit raw data list of table.
Parsing raw data list 109: parsing raw data list by data resolution module, therefrom obtains all
List data content.
Data are added to list data list 110: by data resolution module by all list data contents, with the shape of list
Formula is stored in list data list.
Terminate 111: data process of analysis terminates.
Data application functional sequence (as shown in Figure 2) its form is unlimited, can be executable program and is also possible to dynamic base.
Application interface module 202: the corresponding interface function for calling application interface module to provide by application program 201, with reality
Existing corresponding function.
Page number and coordinate 203: the OFD where providing the table for wanting to extract to application interface module 202 from application program 201
Document page number and coordinate value relative to page number.
Data management module 204: page number and coordinate are transmitted to data management module by application interface module.
List data list 205: the data list of all tables in current OFD document is stored in list data list.
Matching: carrying out matching operation for page number and coordinate by data management module in list data list, should with confirmation
Whether page number and coordinate belong to list data.206 successes are matched, then are table 208, are otherwise non-table 207.
It returns result to application program 209: matching result being returned into application program by data management module, if result is
Table then returns to 208 data of table, if result is non-table 207, returns and unsuccessfully identifies.
Terminate 210: data application process terminates.
The process (as shown in Figure 3) of data resolution module in the present invention:
Log-on data extraction module 301: data extraction module object is created by data management module.
Traversal table data object 302: from list data list object, each list data object is accessed one by one.
Traversal terminates 303: all list data objects all access completion.
Obtain baseline 304: traversal table data object is exactly to obtain baseline, and baseline is for determining a table pair
The basis of elephant.
Temporary table object 305: the scratchpad area (SPA) of table objects.
Creation temporary table object 306: it if not creating temporary table object, creates.
Baseline is saved to temporary table object 307: base-line data is saved in temporary table object.
Obtain the list data object 308 intersected with baseline: from list data list object, obtaining has intersection with baseline
List data object.
List data object is saved to temporary table object 309: there will be the list data object of intersection all to save with baseline
Into temporary table object.
It obtains temporary table object vertex coordinate 310: from temporary table object, taking out four apex coordinates of table.
Create table objects list 311: the memory block of table objects.
It traverses non-list data object 312: from data object list, accessing data object one by one, select non-table number
According to object.
In temporary table object range 313: being parsed to the data object of access, judge data object whether interim
In the region of table objects.
Non- list data object is saved to temporary table object 314: if not list data within the scope of table, then by non-table
Lattice data are stored in temporary table object.
Traversal terminates 315: terminating the access to data object list.
Temporary table object is saved to table objects list 316: by temporary table contents of object, being saved in table objects column
In table.
It gives table objects list to data management module 317: by the content of table objects list, being sent to data management
Module.
Terminate 318: completing the function of data extraction module.
Claims (9)
1. a kind of table extracting method based on OFD document, including application interface module, data management module, data extract mould
Block and data resolution module, which is characterized in that
The application interface module: providing straightaway interface function, calls to upper layer application;Meanwhile being responsible for calling data
Management module is to realize interface function;
The data management module: for the scheduling to data extraction module and data resolution module, while for summarizing by counting
According to the data content of extraction module and data resolution module, and data are consigned into application interface module and are used;
The data extraction module: parsing OFD document, all data is extracted from OFD document, and it is pressed table number
According to object and non-list data object classification, and uniformly gives classification data to data management module and carry out unified distribution management;
The data resolution module: getting list data object from data management module, by table border lookup algorithm,
All list data objects are subjected to unified classification processing, obtain table objects, and table objects are formed into Table List, and will
Table List gives data management module.
2. the table extracting method according to claim 1 based on OFD document, which is characterized in that the list data: for
Constitute the basic element of table, it will be appreciated that be line segment, it includes the starting point X axis coordinate and Y axis coordinate that have line segment, terminating point X
Axial coordinate and Y axis coordinate, and whether be dotted line data content.
3. the table extracting method according to claim 1 based on OFD document, which is characterized in that the list data pair
As: for a kind of list data set.
4. the table extracting method according to claim 1 based on OFD document, which is characterized in that the non-list data
Object: for the non-segment data content of text, picture and annotation.
5. the table extracting method according to claim 1 based on OFD document, which is characterized in that the table objects: for
The complete table being made of table objects data.
6. the table extracting method according to claim 1 based on OFD document, which is characterized in that the Table List: for
The list being made of table objects.
7. the table extracting method according to claim 1 based on OFD document, which is characterized in that the data parse mould
The detailed process of block is as follows:
1) all list data objects are traversed, the X axis coordinate and Y axis coordinate of taking out its vertex are the list data pair of minimum value
As.
8. two or more identical data objects if it exists then take first list data object found, and will
This list data object definition is baseline;
2) baseline is stored in an interim table objects;
3) traversal searches all list data objects, and all apex coordinates and baseline have the list data pair of intersection in taking-up
As, and the list data object of taking-up is stored in the temporary table object in step 2;
4) it is baseline by found in step 3) first list data object definition, step 3) is repeated, until having traversed institute
There is list data object;
5) four apex coordinates of temporary table object are taken out;
6) all non-list data objects are traversed, all coordinates are non-within the scope of four apex coordinates in step 5) in taking-up
List data object, and save it in temporary table object;
7) table objects list is created, temporary table object is stored in table objects list;
8) table objects list is given to data management module to save.
9. a kind of table extracting method as described in claim 1 based on OFD document, which is characterized in that further include data solution
Analyse functional sequence and data application function process:
The data parsing functional sequence is as follows:
1) start application interface module: creation application interface module object, application program is using application interface module to external
Mouth function;
2) log-on data management module: creation data management module object, data management module are started to work;
3) log-on data extraction module: creation data extraction module object is distributed the content of its work by data management module;
4) log-on data parsing module: creation data resolution module object is distributed the content of its work by data management module;
The data application functional sequence is as follows:
1) call application interface module: application program is to application interface module invoking performance function and incoming parameter, to obtain
List data;
2) call data management module: data interface module calls data management module, according to the parameter that application program is passed to, uses
To judge whether it is effective list data;
3) return the result: data management module returns to application program by the matching to incoming parameter, by matching result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811343405.XA CN109492211A (en) | 2018-11-13 | 2018-11-13 | A kind of table extracting method based on OFD document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811343405.XA CN109492211A (en) | 2018-11-13 | 2018-11-13 | A kind of table extracting method based on OFD document |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109492211A true CN109492211A (en) | 2019-03-19 |
Family
ID=65694795
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811343405.XA Pending CN109492211A (en) | 2018-11-13 | 2018-11-13 | A kind of table extracting method based on OFD document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109492211A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110119502A (en) * | 2019-05-13 | 2019-08-13 | 江西金格科技股份有限公司 | A method of dynamic table single domain is realized based on OFD document |
CN111898433A (en) * | 2020-06-22 | 2020-11-06 | 百望股份有限公司 | Paper bill digitization method and device |
CN116384356A (en) * | 2023-06-02 | 2023-07-04 | 福昕鲲鹏(北京)信息科技有限公司 | Method, device, equipment and medium for creating form row of OFD file |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1858786A (en) * | 2006-06-09 | 2006-11-08 | 宋丽娟 | Electronic file formatting annotate and comment system and method |
CN101206568A (en) * | 2007-12-07 | 2008-06-25 | 华中科技大学 | Gridding application program interface system based on Web |
CN103399857A (en) * | 2013-07-01 | 2013-11-20 | 北京航空航天大学 | General method for extracting document structural information |
CN104346322A (en) * | 2013-08-08 | 2015-02-11 | 北大方正集团有限公司 | Document format processing device and document format processing method |
CN105988979A (en) * | 2015-02-16 | 2016-10-05 | 北京邮电大学 | Form extraction method and device based on PDF (Portable Document Format) file |
CN106897690A (en) * | 2017-02-22 | 2017-06-27 | 南京述酷信息技术有限公司 | PDF table extracting methods |
CN107622230A (en) * | 2017-08-30 | 2018-01-23 | 中国科学院软件研究所 | A kind of PDF list data analytic methods based on region recognition with segmentation |
CN108564990A (en) * | 2018-04-11 | 2018-09-21 | 泰山医学院 | Doctor, which supports, combines data pick-up synchronization system and method, information data processing terminal |
-
2018
- 2018-11-13 CN CN201811343405.XA patent/CN109492211A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1858786A (en) * | 2006-06-09 | 2006-11-08 | 宋丽娟 | Electronic file formatting annotate and comment system and method |
CN101206568A (en) * | 2007-12-07 | 2008-06-25 | 华中科技大学 | Gridding application program interface system based on Web |
CN103399857A (en) * | 2013-07-01 | 2013-11-20 | 北京航空航天大学 | General method for extracting document structural information |
CN104346322A (en) * | 2013-08-08 | 2015-02-11 | 北大方正集团有限公司 | Document format processing device and document format processing method |
CN105988979A (en) * | 2015-02-16 | 2016-10-05 | 北京邮电大学 | Form extraction method and device based on PDF (Portable Document Format) file |
CN106897690A (en) * | 2017-02-22 | 2017-06-27 | 南京述酷信息技术有限公司 | PDF table extracting methods |
CN107622230A (en) * | 2017-08-30 | 2018-01-23 | 中国科学院软件研究所 | A kind of PDF list data analytic methods based on region recognition with segmentation |
CN108564990A (en) * | 2018-04-11 | 2018-09-21 | 泰山医学院 | Doctor, which supports, combines data pick-up synchronization system and method, information data processing terminal |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110119502A (en) * | 2019-05-13 | 2019-08-13 | 江西金格科技股份有限公司 | A method of dynamic table single domain is realized based on OFD document |
CN111898433A (en) * | 2020-06-22 | 2020-11-06 | 百望股份有限公司 | Paper bill digitization method and device |
CN111898433B (en) * | 2020-06-22 | 2024-04-09 | 百望股份有限公司 | Paper bill digitizing method and device |
CN116384356A (en) * | 2023-06-02 | 2023-07-04 | 福昕鲲鹏(北京)信息科技有限公司 | Method, device, equipment and medium for creating form row of OFD file |
CN116384356B (en) * | 2023-06-02 | 2023-08-22 | 福昕鲲鹏(北京)信息科技有限公司 | Method, device, equipment and medium for creating form row of OFD file |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101025738B (en) | Template-free dynamic website generating method | |
CN105930159B (en) | A kind of method and system that the GUI code based on image generates | |
CN103186510B (en) | A kind of method and apparatus of convert documents form | |
US8892990B2 (en) | Automatic creation of a table and query tools | |
CN109492211A (en) | A kind of table extracting method based on OFD document | |
US20170371844A1 (en) | Method, device and terminal for implementing regional screen capture | |
CN109829139B (en) | Method and device for converting DOC/DOCX format streaming file into OFD format file | |
CN109492199B (en) | PDF file conversion method based on OCR pre-judgment | |
US20130191732A1 (en) | Fixed Format Document Conversion Engine | |
US20130326341A1 (en) | Digital comic editor, method and non-transitorycomputer-readable medium | |
WO2019041442A1 (en) | Method and system for structural extraction of figure data, electronic device, and computer readable storage medium | |
CN107633055B (en) | Method for converting picture into HTML document | |
WO2020233023A1 (en) | Psd file editing method implemented based on layering technology, and electronic device | |
CN105975446A (en) | Method and system for displaying word document content by modules in mobile phone terminal | |
CN108389244B (en) | Implementation method for rendering flash rich text according to specified character rules | |
CN106776994B (en) | Application method and system of engineering symbols in engineering report forms and web pages | |
CN110377371B (en) | Style sheet system management method based on Web tag | |
CN110310226B (en) | Picture mosaic display method and system | |
CN111190519A (en) | File and control processing method, device, equipment and storage medium thereof | |
CN109271616A (en) | A kind of intelligent extract method based on normative document questions record characteristic value | |
CN107066438A (en) | A kind of method for editing text and device, electronic equipment | |
CN111859886B (en) | Document generation method and device based on product prototype interface | |
CN111274156B (en) | Automatic identification method and device compatible with multi-frame pages | |
CN109635729A (en) | A kind of Table recognition method and terminal | |
CN115268904A (en) | User interface design file generation method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190319 |
|
WD01 | Invention patent application deemed withdrawn after publication |