CN102855243A

CN102855243A - Method and device for extracting document structure

Info

Publication number: CN102855243A
Application number: CN2011101799728A
Authority: CN
Inventors: 曲刚
Original assignee: Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Priority date: 2011-06-28
Filing date: 2011-06-28
Publication date: 2013-01-02

Abstract

The invention provides a method for extracting a document structure. The method comprises the following steps: acquiring an object of a document; converting the object into a predefined standard format; identifying and marking items in the object in the standard format; and extracting contents of the matched items to form structural data relevant to the document. The invention also provides a device for extracting the document structure. The device comprises an acquisition module for acquiring the object of the document, a conversion module for converting the object into the predefined standard format, a marking module for identifying and marking the items in the object in the standard format, and an extraction module for extracting contents of the matched items to form structural data relevant to the document. By the method and the device, an effect of improving the efficiency for document structure extraction is achieved.

Description

Be used for extracting the method and apparatus of file structure

Technical field

The present invention relates to the digital publishing field, in particular to the method and apparatus that is used for extracting file structure.

Background technology

At traditional publishing area, the document format of books, newpapers and periodicals is just in order to satisfy the demand of traditional printing, be confined to the vision key element such as profile, color, position of literal, figure, image for the description of content, do not make up logic content and the internal relation of document.In the digital publishing field, the granularity of the logic content of document, incidence relation, content more to be paid close attention to, it is the precondition of carrying out the digital content recycling that document is carried out structuring processing.

At present, the method for document content architecture processing is mainly adopted manual processing, the processing personnel are according to predefined rule, legal document content in the naked eyes identification document, and hand filling is in self-defining list.This mode of operation efficient is lower, and workload is large, and makes mistakes easily.

Also having a solution is to adopt computing machine to carry out default matched rule identification file structure.The inventor finds, because existing common document form is more, present solution is that how different document formats is adopted different job operation and system, operates more loaded down with trivial details.

Summary of the invention

The present invention aims to provide a kind of method and apparatus for extracting file structure, to solve the more loaded down with trivial details problem of correlation technique operation.

In an embodiment of the present invention, provide a kind of method for extracting file structure, having comprised: the object that obtains document; Object is converted to predefined standard format; In the object of identification and labeled standards form each; The content of each that extraction is mated is to be organized into the structural data about document.

In an embodiment of the present invention, provide a kind of device for extracting file structure, having comprised: acquisition module, for the object that obtains document; Modular converter is used for object is converted to predefined standard format; The index module is used for interior each of the object of identification and labeled standards form; Extraction module be used for to extract each the content of mating, to be organized into the structural data about document.

The method and apparatus that is used for extracting file structure of the above embodiment of the present invention, because unified in advance the form of object, so automatic identification that can the executing data item has solved the more loaded down with trivial details problem of correlation technique operation, has reached the effect that improves the efficient of extracting file structure.

Description of drawings

Accompanying drawing described herein is used to provide a further understanding of the present invention, consists of the application's a part, and illustrative examples of the present invention and explanation thereof are used for explaining the present invention, do not consist of improper restriction of the present invention.In the accompanying drawings:

Fig. 1 shows the process flow diagram according to the method that is used for the extraction file structure of the embodiment of the invention;

Fig. 2 shows the process flow diagram of the method that is used for according to the preferred embodiment of the invention the extraction file structure;

Fig. 3 shows the synoptic diagram according to the device that is used for the extraction file structure of the embodiment of the invention.

Embodiment

Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.

Fig. 1 shows the process flow diagram according to the method that is used for the extraction file structure of the embodiment of the invention, comprising:

Step S10 obtains the object of document;

Step S20 is converted to predefined standard format with object;

Step S30, each in the object of identification and labeled standards form;

Step S40, the content of each that extraction is mated is to be organized into the structural data about document.

Electronic document commonly used has the various forms such as PDF, WORD, existing recognition of document structures technology can not be identified the object in the document of different-format simultaneously, therefore can only to how different different job operation and the systems of document format employing, operate more loaded down with trivial details, workload is large, makes mistakes easily.And in the present embodiment, because unified in advance the form of object, on the basis of the unified output format of definition, process standardization with document, through the processing of a plurality of links, reach and use same instrument and system to carry out structuring to the document of multiple format, improved process velocity, standard the output document form, and reduced the people for makeing mistakes.

Preferably, step S10 comprises: obtain the inner object that embeds of document and the object of document external linkage.Existing document format more complicated, not only with object embedding in document, can also in document, only comprise the chained address of object.This preferred embodiment is by obtaining the inner object that embeds of document and the object of document external linkage, thereby can guarantee not omit object.

Preferably, step S20 comprises: use the object of default matched rule match-on criterion form, meet each of matched rule with identification; According to default indexing rule each is correspondingly marked.By matched rule and indexing rule are set, thereby can carry out these rules by computer programming at an easy rate, so that operation automation.

Preferably, step S20 also comprises: provide the interface to accept the modification to mark.Because the high complexity of document content, so the content of Computer Automatic Recognition and mark might be inaccurate, this preferred embodiment can correct mistakes the artificially by the interface of man-machine interaction is provided, and is replenished.

Preferably, adopt label and/or content control to mark.This is the mark mode of relatively commonly using, and realizes easily.

Preferably, adopt XML formal definition matched rule and indexing rule.XML is the structured language of standard, is used for defining matched rule and indexing rule ratio and is easier to realize.

Preferably, this method also comprises: the version conversion with document is the version that arranges in advance.At present, even the document of same form, often because the difference of version also can cause and can not process.For example WORD 2003 and WORD 2007 just have larger difference.The version of common various softwares is upward compatible, and namely highest version can compatible lowest version.Therefore can be in advance the version of the document of all different-formats be all unified to be converted to the highest version of this form.

Preferably, object comprise following one of at least: character, figure, image, formula and form.These all are the objects of commonly using, and this preferred embodiment can be applied to most scenes by processing these objects.

Fig. 2 shows the process flow diagram of the method that is used for according to the preferred embodiment of the invention the extraction file structure, may further comprise the steps:

(1) document to be processed is carried out pre-service, comprise the inner object data that embeds of document such as the character collected in the document, figure, image, formula, form and the object data of document external linkage, the classifying and numbering storage; Can also carry out normalized to the version of document, the different editions of same type document is treated to same version, for example, office software commonly used, because software version is different, the version of the document of its generation is not identical yet, in order to simplify processing, the document of lowest version is converted to the document of highest version.

(2) the pretreated data of document are carried out standardization processing, comprise the data that the object datas such as the literal of different pieces of information standard, figure, formula, form are converted to predefined standard format; The form of the indexing rule in this standard format, the automatic indexing device and the destination file of let-off gear(stand) adopts the XML formal definition.

(3) pretreated document data is identified automatically, comprised the peculiar element of identifying document, such as contents such as catalogue, type page, header, footer, title, footnote, endnote, the page numbers.

(4) document data after the automatic identification is carried out automatic indexing, namely according to the indexing rule that pre-establishes, legal data in the document are carried out mark; The device that document is carried out automatic indexing can adopt label and content control to carry out mark.This indexing rule can adopt the XML formal definition.

(5) data behind the automatic identifying processing are carried out mutual index processing, interactive interface namely is provided, revise in the automatic indexing the undesirable index result that the ambiguity because of rule causes, and increase ancillary data beyond document self content to document.

(6) data are extracted derivation, namely extract data and the ancillary data of index, derive and generate predefined destination file.The form of this destination file can adopt the XML formal definition.

This preferred embodiment is on the basis of the unified output format of definition, process standardization with document, processing (comprising the links such as pre-service, standardization, automatically identification, automatic indexing, mutual index, extraction derivation) through a plurality of links, reach and use same instrument and system to carry out structuring to the document of multiple format, improved process velocity, standard the output document form.

Fig. 3 shows the synoptic diagram according to the device that is used for the extraction file structure of the embodiment of the invention, comprising:

Acquisition module 10 is for the object that obtains document;

Modular converter 20 is used for object is converted to predefined standard format;

Index module 30 is used for interior each of the object of identification and labeled standards form;

Extraction module 40 be used for to extract each the content of mating, to be organized into the structural data about document.

This device can carry out structuring to the document of multiple format, has improved process velocity, standard the output document form, and reduced the people for makeing mistakes.

Preferably, acquisition module 10 obtains the inner object that embeds of document and the object of document external linkage.This preferred embodiment can guarantee not omit object.

As can be seen from the above description, the above embodiments of the present invention reach uses same instrument and system to carry out structuring to the document of multiple format, has improved process velocity, standard the output document form, reduced the people for makeing mistakes.

Obviously, those skilled in the art should be understood that, above-mentioned each module of the present invention or each step can realize with general calculation element, they can concentrate on the single calculation element, perhaps be distributed on the network that a plurality of calculation elements form, alternatively, they can be realized with the executable program code of calculation element, carried out by calculation element thereby they can be stored in the memory storage, perhaps they are made into respectively each integrated circuit modules, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the present invention is not restricted to any specific hardware and software combination.

The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a method that is used for extracting file structure is characterized in that, comprising:

Obtain the object of document;

Described object is converted to predefined standard format;

Identification and mark in the object of described standard format each;

The content of each that extraction is mated is to be organized into the structural data about described document.

2. method according to claim 1 is characterized in that, the object that obtains document comprises:

Obtain the inner object that embeds of described document and the object of described document external linkage.

3. method according to claim 1 is characterized in that, identifies and marks in the object of described standard format each and comprise:

Use default matched rule to mate the object of described standard format, with identification meet described matched rule described each;

According to default indexing rule described each correspondingly marked.

4. method according to claim 3 is characterized in that, identifies and marks in the object of described standard format each and also comprise:

Provide the interface to accept the modification to described mark.

5. method according to claim 3 is characterized in that, adopts label and/or content control to mark.

6. method according to claim 5 is characterized in that, adopts the described matched rule of XML formal definition and described indexing rule.

7. method according to claim 1 is characterized in that, also comprises:

Version conversion with described document is the version that arranges in advance.

8. according to claim 1 to 7 each described methods, it is characterized in that, described object comprise following one of at least: character, figure, image, formula and form.

9. a device that is used for extracting file structure is characterized in that, comprising:

Acquisition module is for the object that obtains document;

Modular converter is used for described object is converted to predefined standard format;

The index module is used for identification and marks in the object of described standard format each;

Extraction module be used for to extract each the content of mating, to be organized into the structural data about described document.

10. device according to claim 9 is characterized in that, described acquisition module obtains the inner object that embeds of described document and the object of described document external linkage.