CN102855243A - Method and device for extracting document structure - Google Patents

Method and device for extracting document structure Download PDF

Info

Publication number
CN102855243A
CN102855243A CN2011101799728A CN201110179972A CN102855243A CN 102855243 A CN102855243 A CN 102855243A CN 2011101799728 A CN2011101799728 A CN 2011101799728A CN 201110179972 A CN201110179972 A CN 201110179972A CN 102855243 A CN102855243 A CN 102855243A
Authority
CN
China
Prior art keywords
document
standard format
content
rule
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011101799728A
Other languages
Chinese (zh)
Inventor
曲刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN2011101799728A priority Critical patent/CN102855243A/en
Publication of CN102855243A publication Critical patent/CN102855243A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention provides a method for extracting a document structure. The method comprises the following steps: acquiring an object of a document; converting the object into a predefined standard format; identifying and marking items in the object in the standard format; and extracting contents of the matched items to form structural data relevant to the document. The invention also provides a device for extracting the document structure. The device comprises an acquisition module for acquiring the object of the document, a conversion module for converting the object into the predefined standard format, a marking module for identifying and marking the items in the object in the standard format, and an extraction module for extracting contents of the matched items to form structural data relevant to the document. By the method and the device, an effect of improving the efficiency for document structure extraction is achieved.

Description

Be used for extracting the method and apparatus of file structure
Technical field
The present invention relates to the digital publishing field, in particular to the method and apparatus that is used for extracting file structure.
Background technology
At traditional publishing area, the document format of books, newpapers and periodicals is just in order to satisfy the demand of traditional printing, be confined to the vision key element such as profile, color, position of literal, figure, image for the description of content, do not make up logic content and the internal relation of document.In the digital publishing field, the granularity of the logic content of document, incidence relation, content more to be paid close attention to, it is the precondition of carrying out the digital content recycling that document is carried out structuring processing.
At present, the method for document content architecture processing is mainly adopted manual processing, the processing personnel are according to predefined rule, legal document content in the naked eyes identification document, and hand filling is in self-defining list.This mode of operation efficient is lower, and workload is large, and makes mistakes easily.
Also having a solution is to adopt computing machine to carry out default matched rule identification file structure.The inventor finds, because existing common document form is more, present solution is that how different document formats is adopted different job operation and system, operates more loaded down with trivial details.
Summary of the invention
The present invention aims to provide a kind of method and apparatus for extracting file structure, to solve the more loaded down with trivial details problem of correlation technique operation.
In an embodiment of the present invention, provide a kind of method for extracting file structure, having comprised: the object that obtains document; Object is converted to predefined standard format; In the object of identification and labeled standards form each; The content of each that extraction is mated is to be organized into the structural data about document.
In an embodiment of the present invention, provide a kind of device for extracting file structure, having comprised: acquisition module, for the object that obtains document; Modular converter is used for object is converted to predefined standard format; The index module is used for interior each of the object of identification and labeled standards form; Extraction module be used for to extract each the content of mating, to be organized into the structural data about document.
The method and apparatus that is used for extracting file structure of the above embodiment of the present invention, because unified in advance the form of object, so automatic identification that can the executing data item has solved the more loaded down with trivial details problem of correlation technique operation, has reached the effect that improves the efficient of extracting file structure.
Description of drawings
Accompanying drawing described herein is used to provide a further understanding of the present invention, consists of the application's a part, and illustrative examples of the present invention and explanation thereof are used for explaining the present invention, do not consist of improper restriction of the present invention.In the accompanying drawings:
Fig. 1 shows the process flow diagram according to the method that is used for the extraction file structure of the embodiment of the invention;
Fig. 2 shows the process flow diagram of the method that is used for according to the preferred embodiment of the invention the extraction file structure;
Fig. 3 shows the synoptic diagram according to the device that is used for the extraction file structure of the embodiment of the invention.
Embodiment
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.
Fig. 1 shows the process flow diagram according to the method that is used for the extraction file structure of the embodiment of the invention, comprising:
Step S10 obtains the object of document;
Step S20 is converted to predefined standard format with object;
Step S30, each in the object of identification and labeled standards form;
Step S40, the content of each that extraction is mated is to be organized into the structural data about document.
Electronic document commonly used has the various forms such as PDF, WORD, existing recognition of document structures technology can not be identified the object in the document of different-format simultaneously, therefore can only to how different different job operation and the systems of document format employing, operate more loaded down with trivial details, workload is large, makes mistakes easily.And in the present embodiment, because unified in advance the form of object, on the basis of the unified output format of definition, process standardization with document, through the processing of a plurality of links, reach and use same instrument and system to carry out structuring to the document of multiple format, improved process velocity, standard the output document form, and reduced the people for makeing mistakes.
Preferably, step S10 comprises: obtain the inner object that embeds of document and the object of document external linkage.Existing document format more complicated, not only with object embedding in document, can also in document, only comprise the chained address of object.This preferred embodiment is by obtaining the inner object that embeds of document and the object of document external linkage, thereby can guarantee not omit object.
Preferably, step S20 comprises: use the object of default matched rule match-on criterion form, meet each of matched rule with identification; According to default indexing rule each is correspondingly marked.By matched rule and indexing rule are set, thereby can carry out these rules by computer programming at an easy rate, so that operation automation.
Preferably, step S20 also comprises: provide the interface to accept the modification to mark.Because the high complexity of document content, so the content of Computer Automatic Recognition and mark might be inaccurate, this preferred embodiment can correct mistakes the artificially by the interface of man-machine interaction is provided, and is replenished.
Preferably, adopt label and/or content control to mark.This is the mark mode of relatively commonly using, and realizes easily.
Preferably, adopt XML formal definition matched rule and indexing rule.XML is the structured language of standard, is used for defining matched rule and indexing rule ratio and is easier to realize.
Preferably, this method also comprises: the version conversion with document is the version that arranges in advance.At present, even the document of same form, often because the difference of version also can cause and can not process.For example WORD 2003 and WORD 2007 just have larger difference.The version of common various softwares is upward compatible, and namely highest version can compatible lowest version.Therefore can be in advance the version of the document of all different-formats be all unified to be converted to the highest version of this form.
Preferably, object comprise following one of at least: character, figure, image, formula and form.These all are the objects of commonly using, and this preferred embodiment can be applied to most scenes by processing these objects.
Fig. 2 shows the process flow diagram of the method that is used for according to the preferred embodiment of the invention the extraction file structure, may further comprise the steps:
(1) document to be processed is carried out pre-service, comprise the inner object data that embeds of document such as the character collected in the document, figure, image, formula, form and the object data of document external linkage, the classifying and numbering storage; Can also carry out normalized to the version of document, the different editions of same type document is treated to same version, for example, office software commonly used, because software version is different, the version of the document of its generation is not identical yet, in order to simplify processing, the document of lowest version is converted to the document of highest version.
(2) the pretreated data of document are carried out standardization processing, comprise the data that the object datas such as the literal of different pieces of information standard, figure, formula, form are converted to predefined standard format; The form of the indexing rule in this standard format, the automatic indexing device and the destination file of let-off gear(stand) adopts the XML formal definition.
(3) pretreated document data is identified automatically, comprised the peculiar element of identifying document, such as contents such as catalogue, type page, header, footer, title, footnote, endnote, the page numbers.
(4) document data after the automatic identification is carried out automatic indexing, namely according to the indexing rule that pre-establishes, legal data in the document are carried out mark; The device that document is carried out automatic indexing can adopt label and content control to carry out mark.This indexing rule can adopt the XML formal definition.
(5) data behind the automatic identifying processing are carried out mutual index processing, interactive interface namely is provided, revise in the automatic indexing the undesirable index result that the ambiguity because of rule causes, and increase ancillary data beyond document self content to document.
(6) data are extracted derivation, namely extract data and the ancillary data of index, derive and generate predefined destination file.The form of this destination file can adopt the XML formal definition.
This preferred embodiment is on the basis of the unified output format of definition, process standardization with document, processing (comprising the links such as pre-service, standardization, automatically identification, automatic indexing, mutual index, extraction derivation) through a plurality of links, reach and use same instrument and system to carry out structuring to the document of multiple format, improved process velocity, standard the output document form.
Fig. 3 shows the synoptic diagram according to the device that is used for the extraction file structure of the embodiment of the invention, comprising:
Acquisition module 10 is for the object that obtains document;
Modular converter 20 is used for object is converted to predefined standard format;
Index module 30 is used for interior each of the object of identification and labeled standards form;
Extraction module 40 be used for to extract each the content of mating, to be organized into the structural data about document.
This device can carry out structuring to the document of multiple format, has improved process velocity, standard the output document form, and reduced the people for makeing mistakes.
Preferably, acquisition module 10 obtains the inner object that embeds of document and the object of document external linkage.This preferred embodiment can guarantee not omit object.
As can be seen from the above description, the above embodiments of the present invention reach uses same instrument and system to carry out structuring to the document of multiple format, has improved process velocity, standard the output document form, reduced the people for makeing mistakes.
Obviously, those skilled in the art should be understood that, above-mentioned each module of the present invention or each step can realize with general calculation element, they can concentrate on the single calculation element, perhaps be distributed on the network that a plurality of calculation elements form, alternatively, they can be realized with the executable program code of calculation element, carried out by calculation element thereby they can be stored in the memory storage, perhaps they are made into respectively each integrated circuit modules, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the present invention is not restricted to any specific hardware and software combination.
The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. a method that is used for extracting file structure is characterized in that, comprising:
Obtain the object of document;
Described object is converted to predefined standard format;
Identification and mark in the object of described standard format each;
The content of each that extraction is mated is to be organized into the structural data about described document.
2. method according to claim 1 is characterized in that, the object that obtains document comprises:
Obtain the inner object that embeds of described document and the object of described document external linkage.
3. method according to claim 1 is characterized in that, identifies and marks in the object of described standard format each and comprise:
Use default matched rule to mate the object of described standard format, with identification meet described matched rule described each;
According to default indexing rule described each correspondingly marked.
4. method according to claim 3 is characterized in that, identifies and marks in the object of described standard format each and also comprise:
Provide the interface to accept the modification to described mark.
5. method according to claim 3 is characterized in that, adopts label and/or content control to mark.
6. method according to claim 5 is characterized in that, adopts the described matched rule of XML formal definition and described indexing rule.
7. method according to claim 1 is characterized in that, also comprises:
Version conversion with described document is the version that arranges in advance.
8. according to claim 1 to 7 each described methods, it is characterized in that, described object comprise following one of at least: character, figure, image, formula and form.
9. a device that is used for extracting file structure is characterized in that, comprising:
Acquisition module is for the object that obtains document;
Modular converter is used for described object is converted to predefined standard format;
The index module is used for identification and marks in the object of described standard format each;
Extraction module be used for to extract each the content of mating, to be organized into the structural data about described document.
10. device according to claim 9 is characterized in that, described acquisition module obtains the inner object that embeds of described document and the object of described document external linkage.
CN2011101799728A 2011-06-28 2011-06-28 Method and device for extracting document structure Pending CN102855243A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011101799728A CN102855243A (en) 2011-06-28 2011-06-28 Method and device for extracting document structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011101799728A CN102855243A (en) 2011-06-28 2011-06-28 Method and device for extracting document structure

Publications (1)

Publication Number Publication Date
CN102855243A true CN102855243A (en) 2013-01-02

Family

ID=47401836

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011101799728A Pending CN102855243A (en) 2011-06-28 2011-06-28 Method and device for extracting document structure

Country Status (1)

Country Link
CN (1) CN102855243A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104090920A (en) * 2014-06-17 2014-10-08 安徽教育网络出版有限公司 System for realizing digital content cross-terminal publishing
CN104424252A (en) * 2013-08-28 2015-03-18 北大方正集团有限公司 Verbal information processing method based on extensive markup language and verbal content server
CN105701073A (en) * 2015-12-31 2016-06-22 北京中科江南信息技术股份有限公司 Layout file generation method and device
CN106202229A (en) * 2016-06-30 2016-12-07 广州市皓轩软件科技有限公司 A kind of structural data extraction method for cardiac pacemaker
CN106484663A (en) * 2016-10-12 2017-03-08 天闻数媒科技(湖南)有限公司 A kind of extracting method of document content and device
CN106776515A (en) * 2016-12-16 2017-05-31 刘立 The method and device of data processing
CN107273555A (en) * 2017-08-18 2017-10-20 郑州云海信息技术有限公司 A kind of document information extraction element and method
CN107340946A (en) * 2017-06-16 2017-11-10 贵州广思信息网络有限公司 The method of content control is managed collectively under a kind of big document
WO2019237540A1 (en) * 2018-06-12 2019-12-19 平安科技(深圳)有限公司 Method and device for acquiring financial data, terminal device, and medium
CN111125441A (en) * 2019-11-08 2020-05-08 广东电网有限责任公司 Xml file information processing system
CN111199143A (en) * 2018-10-31 2020-05-26 北大方正集团有限公司 Indexing method, device and equipment of Word thesis and storage medium
CN111259202A (en) * 2020-01-10 2020-06-09 西宁宁光工程咨询有限公司 Document structured data embedding method and system
CN111382621A (en) * 2018-12-28 2020-07-07 北大方正集团有限公司 Parameter adjusting method and device
CN112528602A (en) * 2020-07-28 2021-03-19 浙江明度智控科技有限公司 Method, system and storage medium for analyzing structured content of medical document

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350008A (en) * 2008-08-05 2009-01-21 深圳市蓝韵实业有限公司 Method for switching and sharing isomerization medical information data
CN101430714A (en) * 2008-12-08 2009-05-13 北大方正集团有限公司 Content structuring process method and system based on model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350008A (en) * 2008-08-05 2009-01-21 深圳市蓝韵实业有限公司 Method for switching and sharing isomerization medical information data
CN101430714A (en) * 2008-12-08 2009-05-13 北大方正集团有限公司 Content structuring process method and system based on model

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104424252B (en) * 2013-08-28 2017-12-15 北大方正集团有限公司 Literal information processing method and word content server based on XML
CN104424252A (en) * 2013-08-28 2015-03-18 北大方正集团有限公司 Verbal information processing method based on extensive markup language and verbal content server
CN104090920A (en) * 2014-06-17 2014-10-08 安徽教育网络出版有限公司 System for realizing digital content cross-terminal publishing
CN105701073A (en) * 2015-12-31 2016-06-22 北京中科江南信息技术股份有限公司 Layout file generation method and device
CN106202229A (en) * 2016-06-30 2016-12-07 广州市皓轩软件科技有限公司 A kind of structural data extraction method for cardiac pacemaker
CN106484663A (en) * 2016-10-12 2017-03-08 天闻数媒科技(湖南)有限公司 A kind of extracting method of document content and device
CN106484663B (en) * 2016-10-12 2019-05-03 天闻数媒科技(湖南)有限公司 A kind of extracting method and device of document content
CN106776515A (en) * 2016-12-16 2017-05-31 刘立 The method and device of data processing
CN106776515B (en) * 2016-12-16 2020-02-18 刘立 Data processing method and device
CN107340946A (en) * 2017-06-16 2017-11-10 贵州广思信息网络有限公司 The method of content control is managed collectively under a kind of big document
CN107273555A (en) * 2017-08-18 2017-10-20 郑州云海信息技术有限公司 A kind of document information extraction element and method
WO2019237540A1 (en) * 2018-06-12 2019-12-19 平安科技(深圳)有限公司 Method and device for acquiring financial data, terminal device, and medium
CN111199143A (en) * 2018-10-31 2020-05-26 北大方正集团有限公司 Indexing method, device and equipment of Word thesis and storage medium
CN111382621A (en) * 2018-12-28 2020-07-07 北大方正集团有限公司 Parameter adjusting method and device
CN111125441A (en) * 2019-11-08 2020-05-08 广东电网有限责任公司 Xml file information processing system
CN111259202A (en) * 2020-01-10 2020-06-09 西宁宁光工程咨询有限公司 Document structured data embedding method and system
CN111259202B (en) * 2020-01-10 2023-08-04 西宁宁光工程咨询有限公司 Document structured data embedding method and system
CN112528602A (en) * 2020-07-28 2021-03-19 浙江明度智控科技有限公司 Method, system and storage medium for analyzing structured content of medical document
CN112528602B (en) * 2020-07-28 2021-05-04 浙江明度智控科技有限公司 Method, system and storage medium for analyzing structured content of medical document

Similar Documents

Publication Publication Date Title
CN102855243A (en) Method and device for extracting document structure
US9081412B2 (en) System and method for using paper as an interface to computer applications
US8892995B2 (en) Method and system for specialty imaging effect generation using multiple layers in documents
CN1892642A (en) Method and system for processing forms
CN101196886B (en) System and method for converting word files into XML files
US20140244668A1 (en) Sorting and Filtering a Table with Image Data and Symbolic Data in a Single Cell
US20060242549A1 (en) Method, computer programme product and device for the processing of a document data stream from an input format to an output format
WO2010122429A3 (en) Image-based data management method and system
CN101430714B (en) Content structuring process method and system based on model
US20140169665A1 (en) Automated Processing of Documents
CN110413740B (en) Query method and device of chemical expression, electronic equipment and storage medium
CN111753717A (en) Method, apparatus, device and medium for extracting structured information of text
CN102279847A (en) Method and device for internationalizing software system
CN1763748A (en) Electronic filing system and electronic filing method
CN101008940B (en) Method and device for automatic processing font missing
CN108363943A (en) Clearance robot based on Weigh sensor technology
US20070116363A1 (en) Image processing device, image processing method, and storage medium storing image processing program
CN110675121A (en) Method for collecting picture type file material
US8200009B2 (en) Control of optical character recognition (OCR) processes to generate user controllable final output documents
CN106780302A (en) A kind of digital picture automatic keyline layout method and device
CN101320453B (en) Electronic official document circulation automatization method based on Web service
CN102629244B (en) Multi-language work card generating system and method
US20100023517A1 (en) Method and system for extracting data-points from a data file
CN116009793A (en) Printing method
US8799762B1 (en) Generating forms from user-defined information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130102