CN102855243A - Method and device for extracting document structure - Google Patents
Method and device for extracting document structure Download PDFInfo
- Publication number
- CN102855243A CN102855243A CN2011101799728A CN201110179972A CN102855243A CN 102855243 A CN102855243 A CN 102855243A CN 2011101799728 A CN2011101799728 A CN 2011101799728A CN 201110179972 A CN201110179972 A CN 201110179972A CN 102855243 A CN102855243 A CN 102855243A
- Authority
- CN
- China
- Prior art keywords
- document
- standard format
- content
- rule
- identification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Document Processing Apparatus (AREA)
Abstract
The invention provides a method for extracting a document structure. The method comprises the following steps: acquiring an object of a document; converting the object into a predefined standard format; identifying and marking items in the object in the standard format; and extracting contents of the matched items to form structural data relevant to the document. The invention also provides a device for extracting the document structure. The device comprises an acquisition module for acquiring the object of the document, a conversion module for converting the object into the predefined standard format, a marking module for identifying and marking the items in the object in the standard format, and an extraction module for extracting contents of the matched items to form structural data relevant to the document. By the method and the device, an effect of improving the efficiency for document structure extraction is achieved.
Description
Technical field
The present invention relates to the digital publishing field, in particular to the method and apparatus that is used for extracting file structure.
Background technology
At traditional publishing area, the document format of books, newpapers and periodicals is just in order to satisfy the demand of traditional printing, be confined to the vision key element such as profile, color, position of literal, figure, image for the description of content, do not make up logic content and the internal relation of document.In the digital publishing field, the granularity of the logic content of document, incidence relation, content more to be paid close attention to, it is the precondition of carrying out the digital content recycling that document is carried out structuring processing.
At present, the method for document content architecture processing is mainly adopted manual processing, the processing personnel are according to predefined rule, legal document content in the naked eyes identification document, and hand filling is in self-defining list.This mode of operation efficient is lower, and workload is large, and makes mistakes easily.
Also having a solution is to adopt computing machine to carry out default matched rule identification file structure.The inventor finds, because existing common document form is more, present solution is that how different document formats is adopted different job operation and system, operates more loaded down with trivial details.
Summary of the invention
The present invention aims to provide a kind of method and apparatus for extracting file structure, to solve the more loaded down with trivial details problem of correlation technique operation.
In an embodiment of the present invention, provide a kind of method for extracting file structure, having comprised: the object that obtains document; Object is converted to predefined standard format; In the object of identification and labeled standards form each; The content of each that extraction is mated is to be organized into the structural data about document.
In an embodiment of the present invention, provide a kind of device for extracting file structure, having comprised: acquisition module, for the object that obtains document; Modular converter is used for object is converted to predefined standard format; The index module is used for interior each of the object of identification and labeled standards form; Extraction module be used for to extract each the content of mating, to be organized into the structural data about document.
The method and apparatus that is used for extracting file structure of the above embodiment of the present invention, because unified in advance the form of object, so automatic identification that can the executing data item has solved the more loaded down with trivial details problem of correlation technique operation, has reached the effect that improves the efficient of extracting file structure.
Description of drawings
Accompanying drawing described herein is used to provide a further understanding of the present invention, consists of the application's a part, and illustrative examples of the present invention and explanation thereof are used for explaining the present invention, do not consist of improper restriction of the present invention.In the accompanying drawings:
Fig. 1 shows the process flow diagram according to the method that is used for the extraction file structure of the embodiment of the invention;
Fig. 2 shows the process flow diagram of the method that is used for according to the preferred embodiment of the invention the extraction file structure;
Fig. 3 shows the synoptic diagram according to the device that is used for the extraction file structure of the embodiment of the invention.
Embodiment
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.
Fig. 1 shows the process flow diagram according to the method that is used for the extraction file structure of the embodiment of the invention, comprising:
Step S10 obtains the object of document;
Step S20 is converted to predefined standard format with object;
Step S30, each in the object of identification and labeled standards form;
Step S40, the content of each that extraction is mated is to be organized into the structural data about document.
Electronic document commonly used has the various forms such as PDF, WORD, existing recognition of document structures technology can not be identified the object in the document of different-format simultaneously, therefore can only to how different different job operation and the systems of document format employing, operate more loaded down with trivial details, workload is large, makes mistakes easily.And in the present embodiment, because unified in advance the form of object, on the basis of the unified output format of definition, process standardization with document, through the processing of a plurality of links, reach and use same instrument and system to carry out structuring to the document of multiple format, improved process velocity, standard the output document form, and reduced the people for makeing mistakes.
Preferably, step S10 comprises: obtain the inner object that embeds of document and the object of document external linkage.Existing document format more complicated, not only with object embedding in document, can also in document, only comprise the chained address of object.This preferred embodiment is by obtaining the inner object that embeds of document and the object of document external linkage, thereby can guarantee not omit object.
Preferably, step S20 comprises: use the object of default matched rule match-on criterion form, meet each of matched rule with identification; According to default indexing rule each is correspondingly marked.By matched rule and indexing rule are set, thereby can carry out these rules by computer programming at an easy rate, so that operation automation.
Preferably, step S20 also comprises: provide the interface to accept the modification to mark.Because the high complexity of document content, so the content of Computer Automatic Recognition and mark might be inaccurate, this preferred embodiment can correct mistakes the artificially by the interface of man-machine interaction is provided, and is replenished.
Preferably, adopt label and/or content control to mark.This is the mark mode of relatively commonly using, and realizes easily.
Preferably, adopt XML formal definition matched rule and indexing rule.XML is the structured language of standard, is used for defining matched rule and indexing rule ratio and is easier to realize.
Preferably, this method also comprises: the version conversion with document is the version that arranges in advance.At present, even the document of same form, often because the difference of version also can cause and can not process.For example WORD 2003 and WORD 2007 just have larger difference.The version of common various softwares is upward compatible, and namely highest version can compatible lowest version.Therefore can be in advance the version of the document of all different-formats be all unified to be converted to the highest version of this form.
Preferably, object comprise following one of at least: character, figure, image, formula and form.These all are the objects of commonly using, and this preferred embodiment can be applied to most scenes by processing these objects.
Fig. 2 shows the process flow diagram of the method that is used for according to the preferred embodiment of the invention the extraction file structure, may further comprise the steps:
(1) document to be processed is carried out pre-service, comprise the inner object data that embeds of document such as the character collected in the document, figure, image, formula, form and the object data of document external linkage, the classifying and numbering storage; Can also carry out normalized to the version of document, the different editions of same type document is treated to same version, for example, office software commonly used, because software version is different, the version of the document of its generation is not identical yet, in order to simplify processing, the document of lowest version is converted to the document of highest version.
(2) the pretreated data of document are carried out standardization processing, comprise the data that the object datas such as the literal of different pieces of information standard, figure, formula, form are converted to predefined standard format; The form of the indexing rule in this standard format, the automatic indexing device and the destination file of let-off gear(stand) adopts the XML formal definition.
(3) pretreated document data is identified automatically, comprised the peculiar element of identifying document, such as contents such as catalogue, type page, header, footer, title, footnote, endnote, the page numbers.
(4) document data after the automatic identification is carried out automatic indexing, namely according to the indexing rule that pre-establishes, legal data in the document are carried out mark; The device that document is carried out automatic indexing can adopt label and content control to carry out mark.This indexing rule can adopt the XML formal definition.
(5) data behind the automatic identifying processing are carried out mutual index processing, interactive interface namely is provided, revise in the automatic indexing the undesirable index result that the ambiguity because of rule causes, and increase ancillary data beyond document self content to document.
(6) data are extracted derivation, namely extract data and the ancillary data of index, derive and generate predefined destination file.The form of this destination file can adopt the XML formal definition.
This preferred embodiment is on the basis of the unified output format of definition, process standardization with document, processing (comprising the links such as pre-service, standardization, automatically identification, automatic indexing, mutual index, extraction derivation) through a plurality of links, reach and use same instrument and system to carry out structuring to the document of multiple format, improved process velocity, standard the output document form.
Fig. 3 shows the synoptic diagram according to the device that is used for the extraction file structure of the embodiment of the invention, comprising:
This device can carry out structuring to the document of multiple format, has improved process velocity, standard the output document form, and reduced the people for makeing mistakes.
Preferably, acquisition module 10 obtains the inner object that embeds of document and the object of document external linkage.This preferred embodiment can guarantee not omit object.
As can be seen from the above description, the above embodiments of the present invention reach uses same instrument and system to carry out structuring to the document of multiple format, has improved process velocity, standard the output document form, reduced the people for makeing mistakes.
Obviously, those skilled in the art should be understood that, above-mentioned each module of the present invention or each step can realize with general calculation element, they can concentrate on the single calculation element, perhaps be distributed on the network that a plurality of calculation elements form, alternatively, they can be realized with the executable program code of calculation element, carried out by calculation element thereby they can be stored in the memory storage, perhaps they are made into respectively each integrated circuit modules, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the present invention is not restricted to any specific hardware and software combination.
The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.
Claims (10)
1. a method that is used for extracting file structure is characterized in that, comprising:
Obtain the object of document;
Described object is converted to predefined standard format;
Identification and mark in the object of described standard format each;
The content of each that extraction is mated is to be organized into the structural data about described document.
2. method according to claim 1 is characterized in that, the object that obtains document comprises:
Obtain the inner object that embeds of described document and the object of described document external linkage.
3. method according to claim 1 is characterized in that, identifies and marks in the object of described standard format each and comprise:
Use default matched rule to mate the object of described standard format, with identification meet described matched rule described each;
According to default indexing rule described each correspondingly marked.
4. method according to claim 3 is characterized in that, identifies and marks in the object of described standard format each and also comprise:
Provide the interface to accept the modification to described mark.
5. method according to claim 3 is characterized in that, adopts label and/or content control to mark.
6. method according to claim 5 is characterized in that, adopts the described matched rule of XML formal definition and described indexing rule.
7. method according to claim 1 is characterized in that, also comprises:
Version conversion with described document is the version that arranges in advance.
8. according to claim 1 to 7 each described methods, it is characterized in that, described object comprise following one of at least: character, figure, image, formula and form.
9. a device that is used for extracting file structure is characterized in that, comprising:
Acquisition module is for the object that obtains document;
Modular converter is used for described object is converted to predefined standard format;
The index module is used for identification and marks in the object of described standard format each;
Extraction module be used for to extract each the content of mating, to be organized into the structural data about described document.
10. device according to claim 9 is characterized in that, described acquisition module obtains the inner object that embeds of described document and the object of described document external linkage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011101799728A CN102855243A (en) | 2011-06-28 | 2011-06-28 | Method and device for extracting document structure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011101799728A CN102855243A (en) | 2011-06-28 | 2011-06-28 | Method and device for extracting document structure |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102855243A true CN102855243A (en) | 2013-01-02 |
Family
ID=47401836
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011101799728A Pending CN102855243A (en) | 2011-06-28 | 2011-06-28 | Method and device for extracting document structure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102855243A (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104090920A (en) * | 2014-06-17 | 2014-10-08 | 安徽教育网络出版有限公司 | System for realizing digital content cross-terminal publishing |
CN104424252A (en) * | 2013-08-28 | 2015-03-18 | 北大方正集团有限公司 | Verbal information processing method based on extensive markup language and verbal content server |
CN105701073A (en) * | 2015-12-31 | 2016-06-22 | 北京中科江南信息技术股份有限公司 | Layout file generation method and device |
CN106202229A (en) * | 2016-06-30 | 2016-12-07 | 广州市皓轩软件科技有限公司 | A kind of structural data extraction method for cardiac pacemaker |
CN106484663A (en) * | 2016-10-12 | 2017-03-08 | 天闻数媒科技(湖南)有限公司 | A kind of extracting method of document content and device |
CN106776515A (en) * | 2016-12-16 | 2017-05-31 | 刘立 | The method and device of data processing |
CN107273555A (en) * | 2017-08-18 | 2017-10-20 | 郑州云海信息技术有限公司 | A kind of document information extraction element and method |
CN107340946A (en) * | 2017-06-16 | 2017-11-10 | 贵州广思信息网络有限公司 | The method of content control is managed collectively under a kind of big document |
WO2019237540A1 (en) * | 2018-06-12 | 2019-12-19 | 平安科技(深圳)有限公司 | Method and device for acquiring financial data, terminal device, and medium |
CN111125441A (en) * | 2019-11-08 | 2020-05-08 | 广东电网有限责任公司 | Xml file information processing system |
CN111199143A (en) * | 2018-10-31 | 2020-05-26 | 北大方正集团有限公司 | Indexing method, device and equipment of Word thesis and storage medium |
CN111259202A (en) * | 2020-01-10 | 2020-06-09 | 西宁宁光工程咨询有限公司 | Document structured data embedding method and system |
CN111382621A (en) * | 2018-12-28 | 2020-07-07 | 北大方正集团有限公司 | Parameter adjusting method and device |
CN112528602A (en) * | 2020-07-28 | 2021-03-19 | 浙江明度智控科技有限公司 | Method, system and storage medium for analyzing structured content of medical document |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101350008A (en) * | 2008-08-05 | 2009-01-21 | 深圳市蓝韵实业有限公司 | Method for switching and sharing isomerization medical information data |
CN101430714A (en) * | 2008-12-08 | 2009-05-13 | 北大方正集团有限公司 | Content structuring process method and system based on model |
-
2011
- 2011-06-28 CN CN2011101799728A patent/CN102855243A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101350008A (en) * | 2008-08-05 | 2009-01-21 | 深圳市蓝韵实业有限公司 | Method for switching and sharing isomerization medical information data |
CN101430714A (en) * | 2008-12-08 | 2009-05-13 | 北大方正集团有限公司 | Content structuring process method and system based on model |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104424252B (en) * | 2013-08-28 | 2017-12-15 | 北大方正集团有限公司 | Literal information processing method and word content server based on XML |
CN104424252A (en) * | 2013-08-28 | 2015-03-18 | 北大方正集团有限公司 | Verbal information processing method based on extensive markup language and verbal content server |
CN104090920A (en) * | 2014-06-17 | 2014-10-08 | 安徽教育网络出版有限公司 | System for realizing digital content cross-terminal publishing |
CN105701073A (en) * | 2015-12-31 | 2016-06-22 | 北京中科江南信息技术股份有限公司 | Layout file generation method and device |
CN106202229A (en) * | 2016-06-30 | 2016-12-07 | 广州市皓轩软件科技有限公司 | A kind of structural data extraction method for cardiac pacemaker |
CN106484663A (en) * | 2016-10-12 | 2017-03-08 | 天闻数媒科技(湖南)有限公司 | A kind of extracting method of document content and device |
CN106484663B (en) * | 2016-10-12 | 2019-05-03 | 天闻数媒科技(湖南)有限公司 | A kind of extracting method and device of document content |
CN106776515A (en) * | 2016-12-16 | 2017-05-31 | 刘立 | The method and device of data processing |
CN106776515B (en) * | 2016-12-16 | 2020-02-18 | 刘立 | Data processing method and device |
CN107340946A (en) * | 2017-06-16 | 2017-11-10 | 贵州广思信息网络有限公司 | The method of content control is managed collectively under a kind of big document |
CN107273555A (en) * | 2017-08-18 | 2017-10-20 | 郑州云海信息技术有限公司 | A kind of document information extraction element and method |
WO2019237540A1 (en) * | 2018-06-12 | 2019-12-19 | 平安科技(深圳)有限公司 | Method and device for acquiring financial data, terminal device, and medium |
CN111199143A (en) * | 2018-10-31 | 2020-05-26 | 北大方正集团有限公司 | Indexing method, device and equipment of Word thesis and storage medium |
CN111382621A (en) * | 2018-12-28 | 2020-07-07 | 北大方正集团有限公司 | Parameter adjusting method and device |
CN111125441A (en) * | 2019-11-08 | 2020-05-08 | 广东电网有限责任公司 | Xml file information processing system |
CN111259202A (en) * | 2020-01-10 | 2020-06-09 | 西宁宁光工程咨询有限公司 | Document structured data embedding method and system |
CN111259202B (en) * | 2020-01-10 | 2023-08-04 | 西宁宁光工程咨询有限公司 | Document structured data embedding method and system |
CN112528602A (en) * | 2020-07-28 | 2021-03-19 | 浙江明度智控科技有限公司 | Method, system and storage medium for analyzing structured content of medical document |
CN112528602B (en) * | 2020-07-28 | 2021-05-04 | 浙江明度智控科技有限公司 | Method, system and storage medium for analyzing structured content of medical document |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102855243A (en) | Method and device for extracting document structure | |
US9081412B2 (en) | System and method for using paper as an interface to computer applications | |
US8892995B2 (en) | Method and system for specialty imaging effect generation using multiple layers in documents | |
CN1892642A (en) | Method and system for processing forms | |
CN101196886B (en) | System and method for converting word files into XML files | |
US20140244668A1 (en) | Sorting and Filtering a Table with Image Data and Symbolic Data in a Single Cell | |
US20060242549A1 (en) | Method, computer programme product and device for the processing of a document data stream from an input format to an output format | |
WO2010122429A3 (en) | Image-based data management method and system | |
CN101430714B (en) | Content structuring process method and system based on model | |
US20140169665A1 (en) | Automated Processing of Documents | |
CN110413740B (en) | Query method and device of chemical expression, electronic equipment and storage medium | |
CN111753717A (en) | Method, apparatus, device and medium for extracting structured information of text | |
CN102279847A (en) | Method and device for internationalizing software system | |
CN1763748A (en) | Electronic filing system and electronic filing method | |
CN101008940B (en) | Method and device for automatic processing font missing | |
CN108363943A (en) | Clearance robot based on Weigh sensor technology | |
US20070116363A1 (en) | Image processing device, image processing method, and storage medium storing image processing program | |
CN110675121A (en) | Method for collecting picture type file material | |
US8200009B2 (en) | Control of optical character recognition (OCR) processes to generate user controllable final output documents | |
CN106780302A (en) | A kind of digital picture automatic keyline layout method and device | |
CN101320453B (en) | Electronic official document circulation automatization method based on Web service | |
CN102629244B (en) | Multi-language work card generating system and method | |
US20100023517A1 (en) | Method and system for extracting data-points from a data file | |
CN116009793A (en) | Printing method | |
US8799762B1 (en) | Generating forms from user-defined information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20130102 |